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Jane C. Blake 

Editor 

The second issue of the Digital Technical Journal 
(March 1986) featured papers on the then 
recently announced MicroVAX 11 system, a system 
based on a single-chip VAX implementation In 
this seventh issue, we present papers on the sec- 
ond generation of that chip set, CVAX, the two 
new systems that take advantage of its increased 
performance capabilities, and a new version of 
the VAX/VMS operating system for symmetric 
multiprocessing. 

The new mid-range system based on the CVAX 
chip set is the VAX 6200 family of computers, 
which utilizes a multiprocessing architecture. 
The first of two papers by Brian Allison is an 
overview of this highly configurable, expandable 
system. Brian's second paper offers insights into 
the architectural definition process for the 6200. 

One of the major decisions made by the 6200 
engineers was to design a new interconnect to 
support the multiprocessor system. Rick Gillett 
presents an informative discussion of the com- 
plexities involved in interfacing a microprocessor 
to a high-speed, multiprocessing bus. 

To ensure the availability of first-pass func- 
tional parts, a design verification team of engi- 
neers worked in parallel with the 6200 module 
designers. Jean Basmaji, Glenn Garvey, Masood 
Heydari, and An Singer discuss the computer- 
aided engineering and verification principles the 
team instituted for the project. 

Rod Gamache and Kathy Morse then describe 
the major features of symmetric multiprocessing 
in the VAX/VMS operating system. Of particular 
interest is their description of a new synchroniza- 
tion method implemented in VAX /VMS version 5.0. 



In the last paper related to the VAX 6200 system, 
Bhagyam Moses and Karen DeGregory describe the 
development of workloads to measure VAX 6240 
performance. As part of their discussion, they 
include performance measurements and analysis 

The second new system based on the CVAX 
chip set is the low end MicroVAX 3500/3600 
system, which offers three times the performance 
of its predecessor, the MicroVAX II. In his over- 
view of the major sections of the processor mod- 
ule, Gar)' Lidington relates how schedule and 
performance requirements influenced product 
design decisions. 

Charles DeVane then describes the MicroVAX 
3500/3600 system's rwo-level cache architecture, 
with emphasis on the design of the second level 
cache. He also presents some cache performance 
test results. 

The high performance of both the VAX 6200 
family and the MicroVAX 3500/3600 system is 
attributable in great measure to the CMOS VAX 
family of chips on which these systems are based. 
Our five final papers address the design and 
development of this chip set. Frank Fox, Paul 
Gronowski, Anil Jain, Mike Leary, and Dan Miner 
begin the discussion with an explanation of how 
designers achieved the performance goals for the 
single-chip VAX CPU by reducing ticks per 
instruction and machine cycle time. 

A companion to the CVAX CPU, the floating 
point processor chip offers floating point perfor- 
mance equal to that of the microprocessor for 
integer operations. The approach taken to attain 
this goal and a description of the chip are pre- 
sented by Ed McLellan, Gil VC'olrich, and Bob 
Yodlowski. 

Jeff Winston then discusses the development of 
the system support chip, which provides a com- 
mon core of peripheral system functions. 

Next, Barry Maskas relates the design efforts of 
three groups, one in Japan and two in the U.S., 
that resulted in a single-chip interface between 
the CVAX microprocessor and the Q22-bus I/O 
subsystem 

In our final paper, Dave Morgan describes the 
CVAX memory controller chip, CMCTJL, which is 
optimized for Q bus-based systems. 
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Foreword 




Robert M. Supnik 

Corporate Consultant, 
VLSI Technology, and 
Group Manager, 
Semiconductor Engineering 
Microprocessor Development 

In May 1985, Digital introduced the MicroVAX II 
computer system. Based on the MicroVAX proces- 
sor chip set, the MicroVAX II system offered 
unsurpassed price, performance, and reliability 
characteristics. In the three years since then, 
Digital has sold more than 100,000 systems 
based on the MicroVAX chip set. There are more 
MicroVAX -based systems in the field than all 
other types of VAX systems combined 

In the same three years, the practice of com- 
puter engineering has advanced considerably. 
Faster processors, bigger memories, quieter pack- 
ages, and more complex software have appeared 
in a steady stream. For Digital to remain compe- 
titive, we would need, over time, a second gener- 
ation of VLSI-based VAX chips and systems. The 
chips and systems that constitute the second 
VLSI-based generation are described in this issue 
of the Digital Technical Journal 

The planning for the second generation began 
in 1983. That year, the LSI Group (now Semicon- 
ductor Operations) formulated a multiyear pro- 
gram for the development of both semiconductor 
process technology and leading-edge chip prod- 
ucts. The key characteristics of this process/ 
product plan were 

■ CMOS (complementary metal-oxide-semicon- 
ductor) process technology (Previous Digital 
chips were based on NMOS technology.) 



■ Multiple process generations related by opti- 
cal scaling 

■ VAX microprocessors as the leading edge chip 
development projects 

■ Performance improvements targeted forgreater 
than 5 0 percen t per year 

This program not only provided the LSI Group 
with an overall structure for its process and chip 
development projects; it also provided Digital's 
system groups with a stable, long-term basis for 
planningsystem products 

The program was also a significant leap of 
faith. When it was formulated, there was no 
MicroVAX business The MicroVAX II system was 
two years away from shipment. Almost all design 
resources in the LSI Group and in the lowend sys- 
tem groups were busy with the MicroVAX chip 
set and its related systems. Major development 
projects in technology, chip design, systems 
design, and manufacturing were required to 
bring the program vision to fruition. 

Work began with development of the under- 
lying semiconductor technology Starting in 
1983, a team from Semiconductor Manufactur- 
ing's Advanced Semiconductor Development 
(ASD) defined, simulated, and tested CMOS-1, 
Digital's first CMOS process. When first defined, 
CMOS- 1 's key features — N-well base on a p-type 
epitaxial layer, two levels of metal interconnect, 
2.0 micron feature size, direct scalability to 
15 micron feature sizes — were controversial 
within an industry that was still debating NMOS 
versus CMOS. Over time, these choices have been 
vindicated, and CMOS-1 has proven to be a main- 
stream, robust, highly manufacturable process. 

Equally important was development of design 
methods for larger and more complex chips. The 
Semiconductor Engineering Computer Aided 
Design (CAD) Group continuously refined the 
structured design process first deployed for 
MicroVAX and V-ll. The goals of this effort 
were improved simulation coverage, faster 
turnaround time, and more extensive automated 
verification. One consequence of the increased 
use of CAD tools was a dramatic increase in the 
amount of computing power required. This gen- 
eration of chip development projects used four 
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times as much computing power as the first VLSI 
generation 

The Semiconductor Engineering Microproces- 
sor Group began architectural prework on the 
second-generation chip set (called CVAX) in 
mid- 1984 The overarching goal was simple: 
three times the performance of the MicroVAX 
chip set in less than three years — a compound 
performance growth rate of more than 50 per- 
cent per year. The central processor design 
started from the MicroVAX base but drew upon 
ideas from other VAX implementations, notably 
the 8700. The floating point unit design focused 
on minimal execution flows for the most common 
instructions. Both chips transitioned to imple- 
mentation in 1985. 

The original concept for the CVAX chip set 
had been to build chip-for-chip analogues of 
MicroVAX — a central processor and a floating 
point unit. However, as the flexibility of the new 
CMOS process, and the efficiency of the CAD 
tools, were appreciated by designers, the chip set 
concept expanded beyond the central processor 
to include key peripherals. The implementation 
of these peripheral functions in VLSI chips made 
systems faster, more reliable, and less expensive. 
In addition, it allowed peripheral functions to be 
standardized across multiple system implementa- 
tions and additional functions to be added in 
modular fashion. The Semiconductor Engineer- 
ing peripherals group (now Advanced Develop- 
ment) specified and implemented a memory con- 
troller, a memory driver, a console interface, and 
a Q-bus interface. 

After the MicroVAX II system shipped in May 
1985. the Low-end Systems Group and the Mid- 
range Systems Group became actively involved 
in the specification of the CVAX chips and in 
the definition of new systems utilizing the chip 
set. In the low end, the 3500/3600 systems 
were delined as evolutionary extensions of the 
MicroVAX II. Nonetheless, the performance- 
targets for the new chips posed knotty design 
problems for a system family bounded by both 
cost and packaging considerations. 

In the mid-range, the system designers wished 
to exploit the CVAX chip set's combination of 
high performance and low cost by constructing 



an extensible multiprocessor system. They 
defined a new system interconnect (supported by 
unique chips) to provide unprecedented flexi- 
bility and extensibility in configuring systems, 
and new system packaging to support the con- 
cept However, a general-purpose multiprocessor 
system was feasible only if the VMS operating sys- 
tem could take advantage of the incremental 
power offered by additional processors. This 
required a major restructuring of VMS to support 
symmetric (all processors equal) multiproces- 
sing. Thus, the definition and implementation of 
the mid-range 6200 system family and of VMS 
symmetric multiprocessing support had to be 
closely linked. 

As the engineering development projects pro- 
gressed, manufacturing became heavily involved 
in planning and executing the transition from 
design to volume product. LSI Manufacturing in 
Hudson, Massachusetts, introduced CMOS- 1 into 
multiple fabrication units in order to produce 
prototypes quickly and to ramp up to high vol- 
ume production. System manufacturing groups 
in Westfield (Massachusetts), Albuquerque (New 
Mexico). Puerto Rico, and other sites worked 
closely with the system designers to introduce 
the new manufacturing processes required for 
system production 

The results of these development programs is a 
family of VAX systems with exemplary price, per- 
formance, and reliability characteristics. More- 
over, the programs leave as residuals a set of 
VLSI components from which other products can 
be built, and base technology from which further 
advances in chip and system design will evolve 
The initial program vision has been fulfilled, 
even exceeded. Many people, in teams and indi- 
vidually, worked together to bring this about 
The excellence of the results reflects, in full 
measure, the excellence of the work that they 
have done 
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Brian R. Allison I 



An Overview of the 

VAX 6200 Family of Systems 

Digital's VAX 6200 series is a high-performance, expandable family of 
computer systems that combines low-cost microprocessors with high- 
performance memory and I/O subsystems. Based on the CMOS VAX chip 
set, the VAX6200 CPU module performs at 2.8 times the VAX-1 1/780 system; 
utilizing a multiprocessing architecture, system speeds are available up to 
11 times the VAX-1 1/780 system. The memory subsystem utilizes a multi- 
controller architecture for up to 256MB of total system memory. The XMl 
bus, the electrical interconnect for the system, supports the multiple pro- 
cessors, memory subsystems, and VAXBI channel adapters. The VAXBI is 
used for all I/O devices. 



The VAX 6200 family of computer systems is the 
most recent addition to Digital's line of VAX com 
puter systems. The VAX 6200 systems, primarily 
based on CMOS technology, are mid-range sys- 
tems which exploit multiprocessing techniques 
The VAX 6200 family currently comprises four 
systems, all built from common subassemblies 
Any VAX 6200 system may be upgraded to any 
other VAX 6200 system simply by adding CPU 
and memory modules to the existing cabinet 
This paper provides an overview of the system 
and therefore a context for the five papers that 
follow in this issue. These papers describe sev- 
eral of the components in detail, the engineering 
design effort, the performance evaluation pro- 
cess, and some of the multiprocessing aspects of 
the operating system. 

In the past, CMOS-based microprocessor tech 
nology has been used primarily to build low-cost 
systems. Today, by using multiples of these low- 
cost microprocessors, we are presented a unique 
opportunity to produce a high-performance com- 
puter system when the microprocessors are cou- 
pled with high-performance memory and I/O 
subsystems Although this type of system archi- 
tecture will notdirectly result in faster execution 
of a single task, it does result in greater system 
throughput in applications that have several 
simultaneously computable tasks The architec- 
ture couples the effectiveness of the VMS operat- 
ing system in multi programmed environments 



with hardware optimized for efficient multipro- 
cessor operation . The result is a system that offers 
similar performance for a large class of applica- 
tions at a better price-performance ratio than that 
offered by traditional single-processor, high-per- 
formance computer systems. 

A primary objective of the VAX 6200 system 
design is to provide a highly configurable and 
expandable computing environment. To achieve 
this objective, designers chose a modular sub- 
assembly design for the total system. This modu- 
lar design provides for cost-effective basic sys- 
tems and also allows for system expansion to 
achieve higher performance. All members of the 
VAX 6200 family arc housed in the same cabinet 
and use the same basic subassemblies The only 
difference is the number of processors, amount of 
memory, and number of I/O devices. Table 1 
details the configurations of the VAX 6210, 
VAX 6220, VAX 6 2 3 0, and VAX 6 2 4 0 syste m s . 

System A rchitecture 

All VAX 6200 systems consist of CPU(s), mem- 
ory, and I/O channel adapters connected to a 
common system interconnect known as the XMI. 
The VAXBI is used as the interconnect to all I/O 
devices in the system. 1 All memory and I/O 
devices are equally accessible by all CPUs in the 
system. Figure 1 shows a block-level diagram of 
the VAX 6200 system 
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Table 1 VAX 6200 Family System Configurations 



VAX 6210 



VAX 6220 



VAX 6230 



VAX 6240 



Number of processors 
Main memory 
VAXBI channels 
CPU cycle time 
Cache size 
(per CPU) 
Free XMI slots 
Performance 
(times one 
VAX-1 1/780 system) 
Maximum CPUs 
Maximum memory 
Maximum VAXBI 
channels 



1 

32MB 
2 

80 ns 

1KB 

256KB 

10 

2.8 



4 

256MB 
6 



2 

64MB 
2 

80 ns 
1KB 
256KB 
8 

5.5 



4 

256MB 
6 



3 

64MB 
2 

80 ns 
1KB 
256KB 
7 

8.3 



4 

256MB 
6 



4 

128MB 
2 

80 ns 
1KB 
256KB 
4 

11.0 



4 

256MB 
6 



VAXBI 
CHANNEL 
ADAPTERS 
(6 MAXIMUM) 



KDB50 



4 CPUS MAXIMUM 
UPTO 11 X VAX-11/780 



CPU 



UP TO 256MB 



MEMORY 



XMI 1 00MB/SECOND 



VAXBI 1 



DM83? 



VAXBI 2 



DEBNA 



CIBCA 



DRB32 



DEBNK 



VAXBI 3 



VAXBI 4 



VAXBI 5 



VAXBI 6 



OPTIONAL VAXBI 
EXPANDER CABINET 



Figure I VAX 6200 System Block Diagram 
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The primary goal of the VAX 6200 system is 
to allow higher levels of system performance 
through multiprocessing To simplify software 
design and to be consistent with previous multi- 
processor architecture, it was essential to pro- 
vide a shared memory resource. All system mem- 
ory is a global resource accessible through the 
same address space from each processor and from 
all I/O devices. A sophisticated multilevel cache 
contained locally in each CPU minimizes mem- 
ory accesses on the XMI. Cache coherency is 
maintained totally by hardware. 

Technology 

The VAX 6200 systems are based on a number of 
different CMOS technologies The VAX CPU chip 
set and the system interconnect transceivers are 
implemented entirely in Digital's full custom 
CMOS process featuring a size of 1 .5 microns. 2 

The interface between each module and the 
system interconnect is implemented in channel- 
less 1 5-micron CMOS gate arrays. The number of 
gates used in these arrays varies from 1 8K to 50K 
gates. The interface to the VAXBI and the XMI 
arbitration system is implemented in 1. 5-micron 
channeled arrays The on board CPU caches are 
implemented with 4 5-nanosecond (ns) 64K-by-4 
CMOS static random-access memories (SRAMs) 
and industry-standard CMOS cache tag chips. 

All VAX 6200 XMI and VAXBI modules are 
connected to their respective backplanes by a 
300-pin zero insertion force (ZIF) connector. All 
modules use 10-layer controlled impedance 
printed circuit boards. Ail cables from the mod- 
ules are connected through the backplane to 
improve reliability and to minimize the task of 
replacing modules. 

The VAX 6200 XMI backplane is a 14-layer 
controlled impedance printed circuit board. Side 
1 consists entirely of surface-mount contacts for 
the ZIF connector. Side 2 consists of plated 
through holes for power strips and I/O pins, and 
surface-mount pads for resistors. These surface- 
mount resistors form the termination network for 
the XMI signal lines. 

VAX 6200 XMI modules use a printed circuit 
board very similar to the VAXBI printed circuit 
board. XMI modules have the same finger 
pin design as the VAXBI, but the module size is 
28 cm (11 025 inches) deep instead of 
20 38 cm (8 025 inches) deep 

The VAX 6200 modules make use of advanced 
module technology features to maximize both 



the number of I/Os available to VLSI chips and 
the amount of logic that can be put on a module. 
Surface-mounted components are used exten- 
sively throughout the system. Further, many pas- 
sive components and a limited number of active 
surface-mounted components reside on side 2 of 
the modules. All VAX 6200 modules I imit the use 
of surface mount to 50-mil lead pitch compo- 
nents with vias on 100-mil centers. Across the 
modules in the system, there is a m ixture of smal I 
outline integrated circuit (SOIC), plastic leaded 
chip carrier (PLCC), and cerquad surface-mount 
packages. 

All VAX 6200 XMI modules interface to the 
XMI through a set of eight semicustom parts. 
These eight chips are physically mounted on a 
section of the module known as the "XMI cor- 
ner." This section of the module is approximately 
12.7 cm (5 inches) by 3 cm (1.2 inches) and is 
located by the A, B, and C connectors of the mod- 
ule. (See Figure 2.) The XMI interface area is 
identical on all modules so that a common elec- 
trical load is presented to all slots on the XMI. 
The XMI corner has four 44-pin cerquad pack- 
ages on side 1 of the module and four 44-pin 
cerquad packages on side 2. In addition, approxi- 
mately 100 surface-mounted-device (SMD) sig- 
nal termination resistors and bulk power capaci- 
tors are divided evenly across both sides of the 
module in the XMI corner. 

Figure 2 is a photograph of the three VAX 6200 
XMI modules. Note that all three modules have 
the identical components in the lower right cor- 
ner and a similar gate array directly above the 
XMI corner. 

VAX 6200 CPU Module 

As noted earlier, the VAX 6200 CPU (KA62A) is 
based on the CMOS VAX chip known as the CVAX 
The KA62A is a single module that implements 
a full CPU subsystem. Included on the KA62A 
module are 

■ The CVAX chip, which includes a 1 kilobyte 
(KB) on-chip cache 

■ An external 256KB cache 

■ A floating point accelerator chip (CFPA) 

■ Console support hardware 

■ An interface to the XMI 

Figure 3 shows a block diagram of the KA62A 
module. 
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Figure 2 Three VAX 6200 XMl Modules 
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Figure 3 VAX 6200 CPU Module (KA 62 A ) Block Diagram 
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Using the CVAX processor with an 80-ns cycle 
time, the KA62A CPU module performance is 
approximately 2.8 times that of the VAX- 1 1 /780 
system For a total system performance up to 
1 I times greater than the VAX- 1 1 /780, up to four 
KA62A CPU modules may be configured in a 
VAX 6200 system 

The KA62A CPU module contains a two-level 
cache to reduce memory access time. The pri- 
mary cache is 1KB in size and resides inside the 
CVAX chip. This cache contains only instruction 
data to eliminate the need to invalidate this data 
as other processors write to cached data loca- 
tions (The VAX architecture provides strict rules 
for modification of instruction type data.) The 
secondary cache is 256KB in size and contains 
data as well as instructions. The KA62A monitors 
write transactions on the system interconnect and 
invalidates any cached locations written by 
another CPU or I/O device. 
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Figure 4 VAX 6200 Memory Module ( MS62A ) 
Mock Diagram 



Memory 

The VAX 6200 memory subsystem is made up of 
memory controller/array modules and is known 
as the MS62A. The MS62A module, shown in 
Figure 4, contains a memory controller chip and 
32 megabytes (MB) of I -megabit (Mb) dynamic 
RAMs (DRAMs). The MS62A maintains a 64-bit 
data path between the memory controller chip 
and the RAMs, and implements an 8-bit error-cor- 
recting code (ECC) for each 64-bit word. The 
MS62A contains hardware to implement up to 16 
"lockable" memory locations per memory array. 
These memory locks are used extensively by pro- 
cessors and I/O devices to ensure singular access 
to data structures in a shared-memory multipro- 
cessor system. 

The greater memory bandwidth required by 
multiple processors and I/O channels is achieved 
by memory interleaving. The MS62A allows inter- 
leaving on 32-byte boundaries. As long as mem- 
ory addresses are randomly distributed across the 
lower 6 address bits, the bandwidth of the total 
memory subsystem can be increased linearly with 
the addition of interleaved memory controllers. 

The MS62A memory modules may be inter- 
leaved two, four, or eight ways. The interleave 
factor is automatically determined by the system 
upon power-up or system initialization. How- 
ever, designers have given the user the ability to 
manually specify the interleave characteristics of 
the memory subsystem. Up to eight MS62A mem- 
ory modules may be configured in a VAX 6200 
system. 

/ /O Channels 

The VAX 6200 system uses the VAXBI bus as the 
interconnect for all I/O devices The system 
interface to the VAXBI is a two-module set called 
the DWMBA. Figure 5 shows a block diagram of 
the DWMBA modules. The DWMBA/A module is 
connected to the XMI, and the DWMBA/B module 
is connected to the VAXBI These two modules 
are interconnected with a 120-wire cable assem- 
bly which may be up to 4.6 meters (15 feet) 
long. 

The DWMBA allows VAXBI devices to read sys- 
tem memory at up to 5.5MB per second and to 
write system memory at up to 1 3 3MB per sec- 
ond Any VAXBI-compatible device may be con- 
nected to the VAX 6200 systems through the 
DWMBA. AJI VAX 6200 systems contain a mini- 
mum of two VAXBI channels and may optionally 
contain up to six VAXBI channels. 
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System Interconnect, theXMI 

The XMI. ihc primary electrical interconnect in 
the VAX 6200 family of computer systems, 
encompasses 

■ The protocol observed by a node on the XMI 

■ The electrical environment of the XiMl 

■ The backplane 

■ The logic used to implement the protocol 

The XMI can support multiple processors, 
multiple memory subsystems, and multiple I/O 
channel adapters 

XMI nodes may be classified as commanders or 
responders, depending on their role in a given 
transaction A commander is a node that is initiat- 
ing an XMI transaction. A responder is the node 
that must act upon the transaction A processor 
node usually acts as a commander. (However, a 
processor node may become a responder if 
another node reads a control/status register on 
the CPU.) Memory nodes, on the other hand, are 
always responders since they cannot initiate an 
XMI transaction I/O nodes may act as either 
commanders or responders. depending on the 
type of I/O operation. The functions of these 
nodes are further explained in sections below. 

Because the XMI is a pended interconnect, sev- 
eral transactions can be in progress simulta- 
neously When an XMI commander initiates a 
request for a read or to solicit an interrupt vector, 
an identifier code is also transmitted to the 
selected responder. At this point, control of the- 
XMI is relinquished, and other transactions are 
allowed to take place while the responder fetches 
the requested read data or interrupt vector. The 
responder then arbitrates for control of the XMI 
and returns the requested data or vector along 
with the identifier code. By monitoring the 
identifier codes, the initial commander is able to 
receive the correct data and continue. 

Arbitration and data transfers occur simulta- 
neously over a multiplexed set of address and 
data lines, and a separate set of arbitration lines. 
The XMI supports quadword, octaword, and hex- 
word reads to memory, as well as quadword and 
octaword memory writes In addition, the XMI 
supports longword-length read and write opera- 
tions to I/O space. These longword operations 
implement byte and word modes required by cer- 
tain I/O devices 



The XMI has 30 address bits, and the smallest 
addressable entity is a single byte. XMI address 
space is divided into two halves by bit 29 of 
the address. When bit 29 equals zero, an address 
is said to fall into memory space When bit 
29 equals one. the address is said to fall within 
I/O space. This arrangement matches the maxi- 
mum physical address as defined by the VAX 
architecture and allows up to 51 2MB of physical 
memory to be addressed The XMI architecturally 
allows up to 16 nodes, but is physically and elec- 
trically constrained to 14 nodes 
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Figure 5 VAX 6200 VAXBI Channel Adapter 
Block Diagram 
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The XMI multiplexes data and address informa- 
tion onto the 64-bit data path. Data transactions 
are initiated with a "command and address" 
cycle, followed by multiple data cycles. The max- 
imum length for an XMI transaction is 32 bytes of 
data. The XMI cycle time is 64 ns. The effective- 
bandwidth of the XMI is a function of the data 
transfer size, as shown in Table 2. 

The XMI architecture allows for three distinct 
classes of devices. 

Processor Nodes 

Each processor node contains a CPU that exe- 
cutes instructions and manipulates data con- 
tained in XMI memory. The processor node can 
execute any instruction set compatible with the 
VAX-style byte addressing and memory locking 
mechanisms A processor node will have a cache 
that must force all written data back to main 
memory Any cached processor module must also 
monitor write traffic on the XMI and invalidate 
any location in its own cache that is written into 
main memory. Processor nodes must also be 
capable of responding to interrupt requests gen- 
erated either by other processors or by I/O 
nodes. 

I/O Nodes 

I/O nodes generally respond to I/O space refer- 
ences either by mapping the data onto another 
bus or by interpreting data as a command. An 
I/O node can also become a commander on the 
XMI and access global XMI memory. I/O nodes 
may generate interrupt sequences directed 
toward processor nodes. However, I/O nodes do 
not respond to commands directed toward mem- 
ory space. 

Memory Nodes 

Memory nodes act only as responders on the XMI. 
They respond to read and write requests directed 
toward memory address space. These requests are 
generated either by processor or I/O nodes. 

Data Integrity 

The XMI contains a number of features to 
enhance the integrity and reliability of the 
interconnect. First, all XMI information transfer 
lines arc parity protected, and XMI command 
confirmation signals are ECC protected. The XMI 
protocol is sufficiently robust to permit detection 
and recovery of all single-bit error conditions on 
these signals. Additionally, the XMI defines time 



Table 2 XMI Bandwidth Based on 



Transaction Size 



Transaction 


Interconnect 


Size in 


Bandwidth 


Bytes 


in MB/second 


4 


31.25 


8 


62.50 


16 


83.33 


32 


100.00 



out conditions that may be used to detect and 
diagnose failures. 

VAX Console 

The VAX 6200 system implements the standard 
VAX console functionality by means of software 
that conditionally executes on each of the KA62A 
CPU modules. Each KA62ACPU module contains 
a serial-line interface, 256MB of read-only mem- 
ory (ROM), 32MB of electronically erasable ROM 
(EEROM), and 512 bytes of RAM. Control is 
passed to the console software upon any one of 
the following occurrences: 

■ System power-up 

■ Initialization 

■ Receipt of a control-P character from the con- 
sole terminal 

■ Execution of the HALT instruction 

■ Some severe error conditions 

Each KA62A CPU has access to console termi- 
nal transmit-and-receive lines carried on the sys- 
tem backplane. Upon power-up, control of the 
system console terminal is dynamically allocated 
to one of the CPUs present in the system. This 
CPU, known as the "boot" processor, provides 
the system interface to the console terminal as 
well as to the switches and lights located on the 
system control panel. 

On receiving commands from the console ter- 
minal, the boot processor may run diagnostics or 
boot an operating system . This processor commu- 
nicates with other processors by means of a struc- 
ture maintained in memory known as the console 
communications area (CCA) . 
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Also considered as part of the console sub- 
system, a TK50 tape drive is included in each 
VAX 6200 system The tape drive is connected to 
the system by means of a TBK50 controller mod- 
ule located on a VAXBI I/O channel and is used 
for the following purposes: 

■ Saving all volatile parameters for the console 
subsystem 

■ Loading the VAX Diagnostic Supervisor (VDS) 
when no disk is available or functional in the 
system 

■ Distributing operating system and layered soft- 
ware 

The TK50 tape drive is also available under oper- 
ating system control as a general-purpose data 
interchange device. 

Built-in Self-test 

Extensive built-in self-test is used by all modules 
contained within the VAX 6200 systems. Upon 
power-up, all modules within the system, with 
the exception of the DWMBA, perform a self-test 
in parallel. After self-test is complete, the CPU 
modules examine each other's status; the one in 
the lowest slot number that passed self-test is 
selected as the boot processor. The boot proces- 
sor then continues to execute an additional test 
to ensure memory accessibility and finally exe- 
cutes a test of the DWMBA. 

Physical Packaging 

All VAX 6200 systems are housed in the same 
cabinet, which is 78 cm (30.5 inches) wide by 
1 54 cm (60.5 inches) tall by 76 cm (30 inches) 
deep The cabinet contains one 14-slot XMI back- 
plane, two 6-slot VAXBI backplanes, and all nec- 
essary power and cooling to sustain a wide range 
of configurations. Figure 6 shows a VAX 6240 
with the from door removed 

The XMI is physically implemented in a 
14-slot backplane assembly containing ZIF mod- 
ule connectors, signal terminating networks, and 
a ccntrali/.eil clock and arbitration system. Mod- 
ules are located on 2 cm (0 8 inch) centers. The 
XMI backplane is supplied with +5 volts (V) for 
general logic, a separate +5 V supply for mem- 
ory, ± 1 2 V for the console terminal line drivers, 
and —5 2 V/ — 2 V for emitter-coupled logic 
(KCL). Presently none of the VAX 6200 XMI 
modules utilizes the ECL voltages, but ECL is 
included for potential future use. 




Figure 6 VAX 624 0 System, Front Door 
Removed 

The VAX 6200 systems all contain two 6-slot 
VAXBI backplanes, which are configured as inde- 
pendent channels The first slot of each VAXBI 
backplane is occupied by the DWMBA/B module, 
leaving 5 slots for standard VAXBI interfaces. All 
systems contain a DEBNK TK50 tape controller 
and a DEBNA Ethernet controller as standard 
equipment. The two VAXBI backplanes are sup- 
plied with + 5 V, ± 12 V, -5.2 V, and -2 V. 

Summary 

The VAX 6200 family of systems merges the 
CMOS VLSI VAX chip, which is used in a number 
of Digital's products, with a very high perfor- 
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mance memory and I/O subsystem. This hard- 
ware, combined with the new fully symmetric 
multiprocessing capabilities of VMS version 5 0, 
allows very high system throughput previously 
achievable only with ECL technology Moreover, 
the extensive use of CMOS technology results in 
physically smaller systems. These smaller sys- 
tems consume less power and are more reliable 
due to the lower component count and lower 
power consu mpt ion 
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The Architectural Definition Process 
of the VAX 6200 Family 

The architectural definition of Digital's VAX 6200 family was governed by a 
twofold goal: to build a system with higher throughput than previous 
CMOS, Q-bus-based systems at a cost lower than ECL-based systems. Deci- 
sions made during the definition process were influenced by firm schedule 
guidelines. Further, the veiy nature of the multiple processor system 
imposed its own requirements, particularly in the definition of the XMl 
bus. This new 64-bit-wide interconnect is specifically designed to meet the 
memory and I/O needs of the symmetric multiprocessor system. Through- 
out the architectural definition process, engineers continually evaluated 
the interdependency of one design decision upon another and against the 
project and schedule goals. By this process, the total definition of the sys- 
tem — the XMl bus, the processor module, memory module, console sub- 
system, and packaging — was achieved. 



Definition of the VAX 6200 family of systems 
began in March 1985. The engineers' intent was 
to design a follow-on product to the VAX 8200/ 
8300 family of systems, still in development at 
that time. This paper discusses the system archi- 
tectural definition process that took place during 
1985 

Like the VAX 8200/8300 family before it, the 
VAX 6200 family provides a system environment 
for a VLSI VAX chip set. This new family of sys- 
tems is a mid-range VAX implementation. In this 
context, a mid-range system is defined as a pro- 
duct with more capability than the Q-bus-based 
systems and less capability than the emitter-cou- 
pled logic (ECL) based systems. 

Project Goals 

The primary goal of the VAX 6200 program was 
twofold: to build a system with greater system 
throughput than the CMOS. Q-bus-based VAX sys- 
tems, and to ensure system cost was lower than 
that of high-performance ECL-based systems. 
Designers would achieve this goal by designing a 
system architecture that allows a moderate num- 
ber of low cost CMOS VAX microprocessors to 
share a common system environment. Such an 
efficient multiprocessor system environment 
would offer higher throughput for a large num- 
ber of applications and at a cost lower than a 
high-performance single processor 



Once the decision to build a multiprocessor 
was made, the next question was how many 
processors to include. Several small computer 
manufacturers were building 8- to 32-processor 
systems at the time Our belief was that the mar- 
ket for systems with numerous processors was 
fairly small because few applications would run 
efficiently on these systems. Therefore, we 
decided to design the VAX 6200 as a 4-processor 
system, with the possibility of expansion to 
8 processors. This arrangement would allow us to 
still configure cost-effective 1- to 2-processor sys- 
tems. If we found a significant number of applica- 
tions could benefit from the larger number of 
processors, we could expand to 8 processors. 

Building an efficient multiprocessor system 
would necessitate optimization of both hardware 
and software functionality. The VMS asymmetric 
multiprocessing code (VMS versions 2 through 
4) that supported the VAX- 1 1/782, VAX 8.300, 
and VAX 8800 systems worked well for compute- 
bound, dual-processor systems. However, asym- 
metric operating system software would not be 
acceptable for larger scale multiprocessors. In 
the existing VMS asymmetric multiprocessing 
design, most operating system code was exe- 
cuted on the processor designated as the "pri- 
mary" processor. Whenever a process needed to 
perform I/O or invoke most of the VMS system 
sen' ices, the process would have to be scheduled 
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on the primary processor The task of making 
VMS more symmetric in its handling of I/O and 
VMS system services was undertaken to support 
the VAX 884 0 and the VAX 6200 families ' 

Discussion of how we chose to optimize the 
VAX 6200 hardware begins in the section '['he- 
System Interconnect. 

Schedule 

In March 1 985 the design of the CVAX chips was 
already well under way. These chips would be 
delivered in time to allow systems to ship in late 
1987. Based on the CVAX chip set schedule, we 
established the fol lowing schedule for the devel- 
opment of the VAX 6200 system: 

Six months of architectural definition 

Twelve months of design/simulation 

Three months to build and test approximately 
five first-pass prototypes 

Six months to build approximately 70 second- 
pass prototypes 

Three months for final testing and manufactur- 
ing introduction 

This two and a half year schedule significantly 
influenced the definition of the system architec- 
ture as well as the selection of implementation 
technologies. (Actual implementation took three 
years. The design/simulation phase took three- 
months longer than expected, and the first-pass 
prototype phase took three months longer than 
expected ) 

The System Interconnect 

The first order of business was to define a new sys- 
tem interconnect. This interconnect would have 
the bandwidth required to support the memory 
and I/O needs of the multiple processors We 
outlined three requirements that would affect the 
design of the new system interconnect 

■ We estimated that each CVAX processor would 
require between 3 megabytes (MB) and 6MB 
per second of data to/from memory This rate- 
would depend on the clock rate of the pro- 
cessor, the selected cache architecture, and 
the cache "hit" rate of the program being 
executed 

■ We also estimated that each processor could 
require peaks of 1 MB to 1 5MB per second of 
I/O bandwidth 



■ To maintain predictable memory access time, 
we decided that the system bus should not be 
run over 75 percent utilized 

Using the worst-case anticipated bandwidth 
needs, 80MB per second of peak bus bandwidth 
would be required to support 8 processors. 

Because of the tight schedule and our aware- 
ness of the significant amount of time needed to 
design a new system bus, we first looked into the 
feasibility of using an existing bus. We consid- 
ered but rejected the existing VAXBI bus, the 
primary interconnect for the VAX 8200/8300 
system, because of its limited 13. 3 MB per second 
bandwidth. We also rejected the NMI bus, 
the VAX 8500/8700/8800 family interconnect, 
because this bus uses ECL technology. At one 
point we even considered using the SBI from the 
VAX- 1 1/780 system with a 64-bit data path 
instead of its existing 32-bit data path After 
extensive analysis, however, we decided a new 
system bus would have to be engineered for the 
product to meet i ts goals. 

Although we would have to define a new bus 
for processor-to- memory communications, the 
schedule did not allow us to design a full com- 
plement of I/O interfaces for the new bus. Since a 
large number of I/O interfaces would be avail- 
able on the VAXBI, the design team decided to 
use the VAXBI as the interconnect to all I/O 
devices. The new system interconnect, the XMI, 
would be used only to connect processors, mem- 
ories, and VAXBI channel adapters. Therefore, 
in addition to the requirements listed above, 
the XMI architecture would also allow multiple 
VAXBI channel adapters to optimize I/O through- 
put where necessary for large systems. Use of the 
VAXBI for I/O adapters also had the positive 
effect of minimizing the number of electrical 
interconnects to the XMI; the physical length of 
the XMI would consequently be shorter and the 
total capacitance lower. Further discussion of the 
channel adapters is presented in the section 
VAXBI Channel Adapters. 

In June 1985 a team of II senior-level engi- 
neers was assembled to produce the architectural 
and electrical specification for the XMI bus and 
the VAX 6200 system In addition to architectural 
and electrical experts, this team included one 
representative from each of the anticipated mod- 
ule design teams Almost all members had previ- 
ously worked on projects involving the VAXBI 
bus It was understood that the XMI would be 
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used solely for the VAX 6200 family of systems, 
unlike the VAXB1, which would be used across 
many different applications A strict adherence to 
this premise greatly helped the specification 
team to put technical trade-offs in perspective. 

XMI Electrical In t erf ace Definition 
Since most of the VAX 6200 system is CMOS and 
transistor-transistor logic (TTL) based, we imme- 
diately decided the XMI could not be imple- 
mented in EC1. To maintain a TTL- level bus and 
to achieve the desired bandwidth, the data path 
clearly would have to be 64 bits wide Further, to 
meet our goal of 80MB per second bandwidth, 
the XMI would have to transfer 64 bits of infor- 
mation even' 80 nanoseconds (ns). (This transfer 
rate assumes a protocol in which address and data 
are multiplexed, and up to 32 bytes of data can 
be transferred per address cycle.) 

Several electrical alternatives were considered 
for the XMI. A scheme using the commercially 
available FutureBus components was seriously 
considered. However, we rejected this scheme 
because a large number of components would be 
necessary to implement the 64-bit data path 

The lack of commercially available compo- 
nents to drive a 64 -bit bus at the required speed 
finally led us to a decision. We would design a 
bit-sliced custom CMOS bus interface chip set 
Fach chip would transceive 1 1 lines, and seven 
chips would be used for the entire data path. 
Although the "sliced" bus interface would use 
more module real estate than a larger chip, the 
sliced bus design greatly simplified the chip 
packaging problems Each chip would fit into a 
standard 44-pin cerquad package. A sliced XMI 
interface also allows each chip to dissipate under 
0.5 watt (W), which enhances reliability and 
relieves the need for heat sinks on the part. With- 
out heat sinks, the XMI interface parts can be 
mounted on both sides of each module. This 
arrangement saves 50 percent of the real estate 
necessaryto interface to the XMI. 

To simplify the design of the full custom XMI 
interface parts, we would keep the functional 
requirements for the parts as simple as possible. 
The XMI interface chips have little knowledge of 
the XMI protocol and serve only as the electrical 
interface Due to the divergent needs of pro- 
cessor, memory, and I/O interfaces, designers 
already knew that each module would need a 
different VLSI chip for XMI interface functions. 
We decided, therefore, that each module VLSI 



chip would be required to supply the logic to 
implement the bus-level protocol. 

As the electrical design of the XMI progressed, 
a bus cycle as fast as 64 ns appeared feasible 
Although not entirely necessary to support the 
stated system performance goals, the faster XMI 
cycle time was strongly pursued to gain extra 
margin in the system design. Furthermore, this 
fast cycle time would allow the possibility of sys- 
tem upgrades to faster processors in the future. 
Consequently, 64 ns became the stated goal for 
the XMI cycle time; 80 ns was the fall back strat- 
egy if the design complexity of a 64 -ns cycle 
time began to place the overall project schedule 
at risk. 

Logic design across the entire system was done 
assuming a 64-ns cycle time. Eventually 64 ns 
became the actual speed of the bus as the CMOS 
process was characterized and the first parts were 
sampled and found to contain sufficient margin to 
support the faster cycle time. 

XMI Protocol Definition 
XMI protocol definition took place in parallel 
with the electrical definition of the bus. It was 
clear from the start that the bus would cycle sev- 
eral times faster than the memory subsystem. This 
difference in cycle times immediately led us to 
the decision that the XMI would run a "pended" 
bus protocol A pended bus protocol allows con- 
trol of the XMI to be relinquished between 
a "read" command and the return of the data from 
the memory subsystem. With multiple processors 
and multiple memory controllers, several read 
commands could be outstanding at a time. 

To optimize data traffic on the XMI bus, 
we needed to define data transfer commands 
of several lengths. Since VAX instructions may 
write as little as 1 byte of data, a 64 -bit write 
command was defined. (There is a mask field 
associated with the write command that allows 
single bytes to be written.) Since the VAXBI 
bus already had commands to transfer 16 bytes 
of data per address, it was essential to allow 
similar commands on the XMI bus to mini- 
mize the interface complexities to the VAXBI. 
Eventually we added a 32-byte read command 
to allow processors to prefetch larger amounts 
of data upon cache misses. A 32-byte write 
command was not implemented, because it 
would be too great a burden for the memory 
controller to buffer multiple 32-byte write 
commands 
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In many cases the protocol of the XM1 is simi- 
lar to that of the VAXBI. In part this similarity 
resulted because the designers of the XMI were 
very familiar with the VAXBI The similarity 
between protocols was also deliberately chosen 
because it greatly reduced the complexity of 
interfacing the XMI to the VAXBI for I/O pur 
poses. 

The bus arbitration scheme is one area where 
designers had to deviate from the method used 
by the VAXBI bus. The VAXBI uses the main 
bus data path for arbitration, which requires 
extra bus cycles. This approach was not feasible 
for the XMI pencled protocol, since two arbi- 
trations are necessary for each read transac- 
tion. Further, the VAXBI arbitration scheme also 
requires a great deal of duplicated logic in eveiy 
module. Due to the large number of allowable 
XMI nodes, it was not feasible to implement 
an arbitration mechanism located on an XMI 
module. To implement arbitration on any XMI 
module would have required a great number of 
signal pins. The solution was to implement a cen- 
tralized arbiter The XMI uses a module physi- 
cally attached to the rear of the backplane as a 
centralized arbiter as well as the source of the 
master clock 

The subject of data integrity on the XMI was of 
great concern to the designers Initially carrying 
error-checking and correction (ECC) bits on the 
bus was considered However, this scheme was 
rejected because additional encode/decode tim 
ing would have been required, and because addi- 
tional bits would have to be carried on the bus. 
Eventually a robust protocol was implemented 
based on parity detection and hardware/software 
retries when errors are detected. All transient 
single-bit errors on the XMI are recoverable 

XMI Physical Definition 
The physical definition of the XMI was a difficult 
task. There were a great number of interdepen- 
dent trade-offs for module size, module spacing, 
number of backplane slots, and cabinet size- 
To minimize design complexity, we had 
decided at a very early stage that each module 
within the VAX 6200 system would implement a 
single function. Thus the task of designing each 
module was simplified and the diagnosability of 
the system enhanced. Initially, the size of the 
XMI module was largely governed by the space 
needs of the processor Analysis showed that a 
processor based on the CVAX chip set could fit on 



a module the same size as the existing VAXBI 
module 20.32 cm (8 0 inches) by 23.33 cm 
(9 187 inches). In addition, 32MB of memory 
could fit on the same size module. 

System packaging was another factor to con- 
sider in selecting the module size. From the very 
start of the VAX 6200 program, it was not clear 
what type of system-level packaging was optimal 
Designers knew, however, that the larger systems 
would primarily be placed in computer-room 
environments. For these applications, a standard 
153 67-cm (60.5-inch) tall cabinet would be 
necessary. What was not clear was if office-type 
packaging or rack-mount-type packaging would 
be required Since VAXBI formfactor pedestal and 
rack-mount box packages were both available, 
designers found it very attractive to use the same 
formfactor module for the XMI to ease the devel- 
opment of these packages if necessary. Based on 
the functionality fit and the desire to potentially 
reuse existing packaging, we decided to adopt 
the VAXBI module size for the XMI. 

Another advantage to using the VAXBI module 
size was the opportunity to use the VAXBI zero 
insertion force (ZIF) backplane connector. His- 
torically, developing new backplane connector 
technologies has proven difficult and time- 
consuming. The VAXBI uses a five-segment, 
60-pin-per-segment connector. Of the 300 pins, 
I 20 pins are assigned to the VAXBI signals and 
180 pins to each module for I/O use. Since the 
XMI has 32 more data-path bits than the VAXBI, 
designers chose to allot an extra 60 pins for the 
XMI signals, This leaves 1 20 pins for general 
module use. Designers believed the arrange- 
ment to be acceptable, since there are no I/O 
modules for the XMI The only use for the I/O 
pins is to connect to the VAXBI card cages. The 
120 available pins are more than adequate for 
this function. 

To meet the cycle time goals for the XMI bus, 
the length of the XMI would have to be limited to 
about 0.3 meters (12 inches) and the number of 
loads limited to approximately 16 The XMI pro- 
tocol assumed a maximum of 1 6 devices would 
interface to the XMI bus. Eventually the number 
of slots in an XMI backplane became 14 for two 
different reasons. First, 1 4 slots would allow a 
system to have 8 processors, four memory arrays, 
and two VAXBI channels. Second, a 1 4-slot XMI 
backplane would be very similar in size to the 
pair of 6-slot VAXBIs that already existed in the 
VAXBI pedestal and rack-mount box packages. 
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XMI module spacing of 2.03 cm (0 8 inches) 
is the same as that on the VAXBI bus We chose 
this spacing to allow for heat-sink components 
on side 1 of the module. Enough height would 
remain to allow non-heat-sink, surface-mounted 
components on side 2 . 

About 18 months into the program, the module 
designs were complete, and both the pro- 
cessor module and memory module were experi- 
encing great difficulty during printed circuit 
board layout. Although all components could 
be placed within the area available, the very 
high pin-count gate arrays in use (223 pins) 
were causing considerable routing problems. 
To lower the schedule risk to the program, 
designers decided to lengthen the module by 
7.62 cm (3 inches). The impact to the computer- 
room packaging was minimal because a 76-cm 
(30-inch) cabinet depth could accommodate 
the change. However, the change in module 
length made impossible the adaptation of the 
existing VAXBI pedestal and rack-mount packag- 
ing to the XMI At this time the pedestal-based 
strategy for the MicroVAX 3500/3600 systems 
was clear, thus reducing the need to package 
the VAX 6200 family of systems for office use. 
Further, extremely low sales of rack-mounted 
VAX 8200/8300 systems led us to the decision 
that a rack-mount package was not immediately 
necessary 

XMI Interface Technology 
The decision to implement the XMI electrical 
interface in simple full custom CMOS parts dic- 
tated that each module have additional logic to 
complete the XMI interface and to supply mod 
ulc-specific logic. To simplify both the design of 
the XMI interface parts and the CAD tools, we 
decided that all module-to-XMI interfaces would 
be implemented in the same technology. Given 
the aggressive design schedule, we would need a 
technology that was mature as well as easy to 
design for 

We initially focused on a family of 2-micron 
CMOS gate arrays available from LSI Logic and 
Toshiba. However, it quickly became clear that 
array limitation of approximately 10,000 gates 
would force us to place multiple chips on each 
module. The use of multiple chips was highly 
undesirable from the perspective of design 
resources, module real estate, and cost. A search 
was started to locate a suitable alternative. To get 
the desired logic density, several semicustom 



alternatives were explored but ultimately 
rejected because of the immaturity of their CAD 
tools. 

Discussions with LSI Logic Corporation led us 
to consider their newly developed 1.5-micron 
"Sea of Gates" array, which offers up to 50.000 
routable gates. Although this array did not give us 
the mature technology we were seeking, it did 
appear to offer the flexibility needed by all XMI 
designs. We ultimately chose the LSI Logic 
LL1 0000 family of gate arrays because all designs 
could use the same technology. Moreover, we 
could focus our CAD tool development on a 
single technology 

The 64-bit-wide XMI data path forced the pin 
count of a single interface chip to be 200-plus 
pins. The LSI Logic LL10000 array was offered 
in a 223-pin pin-grid-array (PGA) package which 
appeared suitable. Although most of the logic 
on each module was implemented in surface- 
mounted components, we did not pursue a 
223-pin. surface-mount package We wanted to 
avoid the manufacturing problems presented by 
components with 25-mil pitch leads. 

The Processor Module 

The VAX 6200 processor module uses the CVAX 
chip set to implement the VAX instruction set. 
Due to an uncertainty about the final CVAX chip 
speed, the CPU module was designed to operate 
over a range of 70 ns to 1 00 ns. The intent was to 
use "binned" parts in the VAX 6200 system, and 
to use the "nominal" parts in the MicroVAX 
3500/3600 systems. (Chip manufacturing pro- 
cesses yield parts of different speeds; "binning" 
refers to the process of testing the chips over a 
range of speeds.) For the CVAX chip set. the nom- 
inal parts run at 90 ns, and the binned parts run 
at 80 ns. 

A major system-wide architectural issue, which 
primarily affected the processor module, was 
whether the cache should be write-back or write- 
through. Although a write-back cache could 
potentially reduce the number of processor 
writes on the XMI by 50 percent, such a cache 
was complicated and had never before been 
designed for a multiprocessor VAX system. Our 
final decision was based on the need to reduce 
overall risk to the program. Therefore, we would 
implement the more straightforward write- 
through cache design and build the extra band- 
width into the XMI to handle the additional write 
traffic. 
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Once the decision to implement a write- 
through cache was made, the major architectural 
issue for the processor module became the cache 
organization The CVAX chip contains an internal 
1 -kilobyte (KB), two-way set associative cache- 
accessible to the internal micro engine in one 
cycle. Due to the long latency to main memory, a 
second-level cache on the processor module was 
imperative. The size of the second-level cache- 
was determined by the available static random- 
access memory (SRAM) technology. The newly 
available high-speed 64K-by-4 SRAMs would 
provide a 256KB cache with only eight parts. 
Although no accurate simulation was available to 
indicate the effect of this large cache, the effects 
were assumed to be positive. Therefore we 
decided that the higher cost of the SRAMs was a 
worthwhile trade-off given the potential gains in 
system performance 

A third major issue relative to the caches on the 
processor module was the invalidation scheme. 
In the past. VAX processors have managed cache 
invalidation, since processors and I/O devices 
have always shared a common memory subsys- 
tem. The issue of cache invalidation became 
much more important to our program because 
of the multiprocessor nature of the VAX 6200 
system This type of system could cause large- 
amounts of stale data as a process migrates from 
processor to processor. 

The 1KB cache contained within the CVAX 
chip caused the largest problem. If it were 
allowed to cache data that could become stale, 
every write to memory would potentially have 
to be invalidated within the CVAX cache. 
This meant choosing one of two approaches: 
(1) broadcasting ever)' write in the system onto 
the CDAI. bus of every processor, or (2) finding a 
way to maintain a duplicate tag store of tags 
within the CVAX chip and only passing writes 
known to reference cached data within the CVAX 
onto the CDAI.. Another alternative was to cache 
only instruction-stream (1 -stream) data within 
the internal cache. This alleviates the need to 
invalidate, because I-stream data is defined to be 
read-only by the VAX architecture. We projected 
this alternative could cause a 3 to 5 percent 
degradation in CPU performance 

Analysis of the cache-invalidate problem 
proved very difficult, because we did not know 
what percentage of data would be shared in this 
class of multiprocessor system. With the poten- 
tial for 8 processors, it was clear that all writes 



could not be broadcast into each of the CVAX 
chips The possibility of maintaining a duplicate 
external tag store proved to be very difficult to 
implement. Consequently, we chose the alterna- 
tive to store only I-stream data within the internal 
CVAX cache. 

A similar problem was knowing when to invali- 
date data in the external cache In this case it was 
feasible to implement a duplicate tag store. The 
second-level cache has two tag stores. One is 
located on the CDAL and is used for cache look- 
up by the CVAX chip. The second tag store is 
located within the XMI interface and is used to 
determine if XMI writes hit the second-level 
cache When hits are detected, a request is 
queued to invalidate the entry within the second- 
level cache. 

Another problem to be solved on the processor 
module was the issue of combining writes into 
larger blocks before issuing them to the XMI 
Since the CDAI data path is only 32 bits wide, the 
CVAX chip is incapable of generating a write 
command any larger than 32 bits The 64 -bit data 
path of the XMI would need larger writes to oper- 
ate efficiently. The solution to this problem was 
to implement a "write buffer" in the XMI inter- 
face of the processor module. The write buffer 
takes advantage of the fact that writes generated 
by VAX processors are often sequential The 
write buffer will buffer up to four sequential 
32-bit writes and combine them into a single 
XMI write transaction. 2 

The Memory Module 

The system design goal was to provide the capac- 
ity for 15MB to 30MB of memory per processor 
As mentioned earlier, the module size was par- 
tially governed by the need for 32MB of memory 
per memory module. The number of slots in the 
XMI backplane was also partially determined by 
the desired amount of system memory 

The wide range of possible VAX 6200 con- 
figurations dictated the need for an expandable 
memory subsystem Since full memory band- 
width would only be necessary for very large- 
configurations, it was decided to adopt a dis- 
tributed memory architecture. An individual 
memory controller could be made simpler if it 
did not have to supply full XMI bandwidth. Full 
XMI bandwidth could be achieved by inter- 
leaving multiple memory controllers. 

With the module size and number of slots 
determined, the first architectural decision to be 
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made for the memory was internal organization of 
the memory The 64-bit width of the XMI made it 
desirable to have a 64-bit data path internal to 
the memory module. The very tight module real 
estate made it very attractive to consider imple- 
menting a 64-bit data path to reduce the number 
of required ECC check bits. A 64-bit-wide data 
path was also attractive given that the processor 
module would issue a read for 32 bytes whenever 
there was a cache miss 

The negative side of a 64 -bit internal memory 
organization was that any write of less than 
64 bits in width would result in a read-modify- 
write operation to calculate the proper ECC 
code An analysis of the expected write traffic 
through the processor's write buffer showed that 
approximately 50 percent of all writes would be 
a full 64 bits in width Further analysis showed 
that as long as there was at least one memory 
controller for even' 2 processors, there would 
be sufficient memory bandwidth for the system. 
Given the performance characteristics of the 
CVAX processor, it seemed reasonable to require 
a 32MB memory array for every 2 processors. We 
therefore decided to implement the 64 -bit mem- 
ory internal organization 

Since it was very difficult to design the memory 
module to accommodate the full bandwidth of 
the XMI. designers used memory interleaving to 
provide an aggregate memory bandwidth com- 
patible with the speed of the XMI bus The inter- 
leave size of 32 bytes was determined by 
the protocol of the XMI, which allows reads of 
.32 bytes per address cycle 

The multiprocessing design of the system 
made it possible for a single memory controller 
to be the object of several simultaneous requests. 
To avoid rejection of processor traffic, we 
designed the memory controller with an input 
queue. This queue accepts memory access 
requests and services them in a first-in, first-out 
(FIFO) order 

Initially the memory controller was designed 
with a four-command queue that would reject 
new requests once the queue was full. As the 
design progressed, we realized that with our XMI 
arbitration scheme, a processor or VAXBI channel 
adapter might possibly be denied memory access. 
A processor or channel adapter might be denied 
access for indeterminate periods of time if the 
memory array was allowed to reject commands 
when its queue became full. To avoid this prob- 
lem, the memory array was allowed to assert a 



signal on rhe XMI that would inhibit all new 
commands from being issued on the XMI. Unfor- 
tunately, due to the pipelined nature of the pro- 
cessor and the memory array, three additional 
commands could possibly be received by the 
memory controller after it had determined the 
need to stop additional requests Since the depth 
of the command queue was four, the memory 
array would need to "stall" the bus after receiv- 
ing only a single command. Since this effectively 
eliminated the command queue, we decided to 
lengthen the depth of the command queue to 
eight entries 

The VAX architecture forces the use of a hard- 
ware-based memory lock to control access to 
shared data structures. The memory lock is used 
by some intelligent I/O adapters as well as 
processors 

System performance suffers when there is 
conflict over different lock variables that acquire 
a common hardware lock. Given that Digital had 
never built a fully symmetric multiprocessor sys- 
tem and that major changes were being made to 
VMS, we did not know what the lock traffic pat- 
tern would look like in a large system We did 
know, however, that the existing VAXBI I/O 
adapters and the CVAX processor could not hold 
more than a single hardware lock at one time. 
Based on this, we designed the memory con- 
troller to have up to 16 locked locations. This 
number seemed more than adequate given a 
maximum of 8 processors and only three existing 
VAXBI I/O adapters that use memory locks 
(Ethernet, CI, and TK50) The granularity of 
each lock is 32 bytes to simplify the memory con- 
trollers' handling of 32-byte read requests. Lock 
congestion is still possible if multiple lock vari- 
ables are allocated within the same 32-byte 
region of memory. An examination of VMS code 
shows lock congestion to be very rare 

VAXBI Channel Adapter 
We had decided right from the start to design an 
XMI -to- VAXBI channel adapter to handle all I/O. 
To meet the desired maximum I/O rates of 
I MB to I 5MB per second for each processor, we 
would include multiple XMI-to-VAXBI adapters 
Although two VAXBI channels would allow 
1.5MB per second per processor, it was decided 
to allow up to eight VAXBI channels to be con- 
nected to the VAX 6200 system. The design was 
not made more complex by the change from two 
to eight VAXBI channel adapters. 
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Designers wanted to optimize data transfer 
from the VAXBI into XMI memory, since statisti- 
cally more data is read from I/O devices than is 
written. A double-buffered direct memory access 
(DMA) data path from the VAXBI to the XMI 
allows transfers at the full 13. 3MB per second 
VAXBI data rate. 

For reads of XMI memory, it was known that 
full bandwidth could not be maintained due to 
the memory read latency through the VAXBI 
channel adapter and the XMI memory subsystem. 
Since most I/O transfers are sequential, we con- 
sidered building prefetch buffers into the VAXBI 
channel adapter. Because transfers could be in 
progress to several VAXBI nodes at once, multiple 
prefetch buffers would be needed. Since prefetch 
buffers can architecturally be considered to be 
small caches, the VAXBI channel adapter would 
also have to monitor all XMI traffic for potential 
invalidate conditions. Eventually the need for 
large amounts of buffer storage and the compli- 
cation of XMI monitoring decided us against 
building prefetch buffers. This decision was 
influenced by other factors as well. No single 
existing VAXBI I/O adapter could read at full 
bandwidth, and multiple I/O devices could be 
spread across several VAXBI channels to achieve a 
higher aggregate XMI read bandwidth 

To ease physical implementation, the VAXBI 
channel adapter was implemented on two mod- 
ules that were interconnected by four 60-pin 
cables between their I/O pins. Unlike the 
VAX 8500/8700/8800 VAXBI channel adapter, 
the VAX 6200 could not use a single XMI module 
to connect to multiple VAXBI buses The 6200 is 
restricted by the 120 I/O pins available on an 
XMI module. 

Console Subsystem 

The console function in a VAX 6200 system is 
performed by code run on the CVAX CPUs This 
use of the main CPU-based console contrasts with 
the more traditional use of a dedicated front-end 
processor, which has access to all system 
resources We chose to use a main CPU-based 
console primarily because we had no way to 
externally access the internal state of the CVAX 
processor. Furthermore, we did not want to add 
the cost of a dedicated console processor. 

A side benefit to a system design that employs 
multiple processors, memories and I/O adapters, 
is the opportunity to design in extra availability 
by reconfiguring the system in the event of a sin- 
gle component failure. To accommodate for 



reconfiguration, all processors would have to be 
allowed access to the physical console terminal 
as well as the physical front control panel of the 
system. This access is accomplished by busing 
the signals that interface to the console terminal 
and front control panel across the XMI. However, 
we needed a mechanism to ensure that only one 
of the processors would actually respond to the 
console terminal and front control panel. This 
mechanism is a protocol whereby the processor 
in the lowest XMI slot that passes self-test 
assumes control of these external resources. The 
processor that takes control of the console termi- 
nal is known as the "primary processor." The pri- 
mary processor communicates with all other pro- 
cessors by means of a message passing protocol 
through system memory. 

It is necessary for the console subsystem to 
have access to a mass storage device Such access 
is needed for distribution of software and for 
loading of diagnostics. The TK50 was selected 
because of its high density and the availability of 
a preexisting VAXBI interface (the DEBNK). The 
TK70 was not used because there was no VAXBI 
interface to it. The only other alternative was the 
RX50, which has superior access time but a data 
capacity of only 400KB. The longer time to run 
diagnostics from the TK50 was unimportant 
since the system can be diagnosed largely by 
means of diagnostics contained in CPU read-only 
memory (ROM) and by the built-in self-test con- 
tained in all VAXBI I/O adapters. Further, the 
TK50 makes an excellent software distribution 
device and allows VAX 6200 systems to be 
configured without nine-track magtape drives. 

In previous systems dependent on ROM-based 
console programs and ROM-based diagnostics, 
code updates have been a problem. To alleviate 
the need to physically change the ROM, each 
VAX 6200 processor contains a 32KB electroni- 
cally erasable ROM (EEROM). Most console and 
diagnostic code is accessed by means of an 
address table contained in the EEROM. In the 
event that a code bug needs to be corrected, the 
address table is rewritten to point to a replace- 
ment routine that is also written into the EEROM. 
The console program implements a routine that 
can patch the EEROM image from a database 
distributed on TK50 tape 

Power and Packaging 

As noted earlier, we projected that the VAX 6200 
system would be used as a large system generally 
located in computer-room environments. An 
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important goal was to design for a high degree of 
flexibility and configurability in the system. The 
decision to use a 1 4 -slot XMI backplane had been 
based on desired maximum configurations and 
the size of existing pedestal and rack-mount 
packages. 

In addition to housing the XMI backplane, the 
computer-room package would need to house 
VAXB1 backplanes to accommodate I/O adapters. 
The VAABI backplane is manufactured in cas- 
cadeable 6-slot segments. It seemed that two 
6-slot VAXBI backplanes would provide adequate 
I/O adapters for most systems. To ensure that all 
customers' I/O requirements could be met, a 
design was also initiated for a VAXBI expander 
cabinet that could house four additional 6-slot 
VAXBI backplanes 

To avoid developing a new power subsystem, 
we looked into modifying an existing power sys- 
tem. Although we could find no preexisting per- 
fect match, we did locate a previously designed 
5 volt ( V ) regulator specified at 1 OO-ampere (A) 
output We respecified this design to 1 20 A by 
using slightly higher power components. The 
VAXBI requires +12 V, -5.2 V, and -2 V in 
addition to the main + 5 V channel. To accommo- 
date the VAXBI requirements, a new regulator 
was designed. The XMI backplane is supplied 
with two of the 5 V regulators (one for main logic 
and one for memory). Although not required by 
any current designs, one of the ± 1 2 V, —5.2 V, 
and — 2 V regulators also supplies the XMI for 
potential future designs. The two VAXBI back- 
planes are supplied by one + 5 V regulator and 
one of the ±12 V, — 5 .2 V, — 2 V regulators. 

Conclusion 

The design of a complex system like the 
VAX 6200 is much more than making well- 
informed engineering decisions based on hard 
data. Engineers based the initial system definition 
on their perceptions of the needs for future com- 
puting systems. The definition was further 
shaped by what was technically feasible with a 
defined degree of risk. Throughout the architec- 
tural specification phase, many trade-offs were 
made with only partial data and the intuitive 
insight of very experienced engineers. 

The design process for the VAX 6200 system 
was extremely smooth, and the product was 
designed within six months of the initial engi- 
neering goals. Due to the large degree of built-in 
configuration flexibility, the product definition 
never changed enough to force a change in direc- 



tion during the design phase. Careful balancing 
of technical complexity with the necessary 
minimum functionality yielded an architecture 
that could be implemented with a manageable 
amount of risk in a bounded amount of time. 

Acknowledgments 

The initial VAX 6200 system definition and archi- 
tectural group was made up of the following peo- 
ple: Brian Allison, Charlie Barker, Frank Bomba, 
Darrel Donaldson, Rick Gillett, Dave Hartwell, 
Dave Ives, Jim Stegeman, Pat Sullivan, Mike 
Uhler, and Doug Williams. 

References 

1 . R. Gamache and K. Morse , "VMS Symmetric 
Multiprocessing," Digital Technical Jour- 
nal (August 1 988, this issue): 57-63. 

2. R. Gillett, "Interfacing a VAX Microproces- 
sor to a High-speed Multiprocessing Bus," 
Digital Technical Journal (August I 988, 
this issue): 28-46. 



Digital Technical Journal 

No 7 August 1988 



27 



Richard B. Gillettjr. I 



Interfacing a VAX Microprocessor 
to a High -speed Multiprocessing Bus 

The design decisions involved in interfacing a microprocessor (CVAX) 
to a high-speed, shared-memory multiprocessing bus (XMI) are more 
complex than those encountered in designing a single-processor sys- 
tem. Although the same basic interface architectures are used, the signifi- 
cantly different multiprocessing environment requires a much more 
complex implementation. In particular, the performance of a multiproces- 
sor system is very dependent on the efficiency of its main memory inter- 
face. To achieve the desired system performance, appropriate compro- 
mises between design complexity and performance must be made. In the 
case of the VAX 6200 system, performance simulations made early in the 
project guided the complexity /performance trade-offs. Actual system 
performance results have largely confirmed the validity of the design 
tradeoffs. 



The primary goal of the VAX 6200 design was to 
provide a general-purpose, high-performance, 
mid-range VAX computing system. Further, this 
system design would take advantage of Digital's 
proprietary CMOS technology and VMS version 
5 0 symmetric multiprocessing capabilities. VMS 
version 5.0 has dramatically changed the way we 
approach mid-range system design; no longer do 
we design a system to support just one or two 
processors With the ability to effectively utilize 
the power of four or more processors within the 
same system came the need to design signifi- 
cantly higher performance interconnects to tie 
these processors together. 

The VAX 6200 was to be Digital's first CMOS 
multiprocessor system. The designers were there- 
fore strongly motivated to provide the best per- 
forming product they could within reasonable 
time and complexity constraints. Complexity was 
of particular concern since the product schedule 
did not allow for the production of second-pass 
parts prior to the first shipment to customers. 
Complex multiprocessor interfaces give ample 
opportunities for the kinds of elusive design bugs 
that can be very difficult and time-consuming 
to exercise and diagnose. In addition, unlike 
other recent VAX systems, the VAX 6200 system 
required a major new release of VMS. (In many 
ways the new release represented a new operat- 



ing system.) We expected its availability could 
be the critical path to product shipment. 

The operating system software would probably 
not stabilize in time for us to discover and fix 
any major hardware problems and still stay on 
our original schedule. Unfortunately, until the 
operating system stabilizes, testing for complex 
bugs is difficult. This concern about complexity 
relative to the schedule affected several design 
decisions 

On the VAX 6200 CPU module, the design 
challenge was to interface a custom CMOS VAX 
microprocessor (called CVAX) to a high-speed 
multiprocessor bus (called XMI). The trade-offs 
made during the design of a multiprocessor sys- 
tem are more complex than those made in 
designing a single-processor system. For a single- 
processor system, the performance trade-offs are 
relatively straightforward. The goal is to design 
the highest performance single-processor system 
that is practically possible. For a multiprocessor 
system, the goal of maximum single-processor 
performance must be tempered to obtain maxi- 
mum system throughput (i.e., multiprocessor 
performance). 

The foundation of the CPU interface is the 
cache subsystem, which reduces the effective 
read access time to main memory. By reducing 
the processor's need to access main memory, a 
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cache improves both single processor and multi- 
processor system performance. This paper dis- 
cusses the complexities involved in choosing the 
optimal cache design and the simulation tech- 
niques used to ensure informed design decisions. 

One of the biggest problems in cache design is 
choosing the correct set of workloads to charac- 
terize the cache performance. Cache perfor- 
mance can vary tremendously with different 
workloads Therefore, we chose a set of work- 
loads that spanned a wide range of system activi- 
ties. Toward the end of this paper, we present 
actual cache performance results that largely 
confirm the legitimacy of our approach. 

We also examine one of the more complex 
aspects of multiprocessor designs, which is 
ensuring cache coherency across the entire sys- 
tem. Cache coherency refers to the maintenance 
of a sufficiently consistent memory state from the 
perspective of all processors and I/O devices 
within the system. 

The designers also went to great lengths to 
ensure maximum system reliability. As part of 
this effort, we generated a set of error-detection 
and response rules. These rules ensure that the 
operating system software can easily recover 
from almost all transient cache or bus failures. 
These rules are discussed. 

The following section is an overview of the 
VAX 6200 system architecture. It provides a basis 
for the subsequent discussions on the challenges 
of multiprocessor design, the VAX 6200 CPU 
responses to those challenges, the performance 
simulation environment, cache coherency and 
error handling, and finally, real performance 
results. 



Summary of VAX 6200 System Architecture 

The basic architecture of the VAX 6200 sys- 
tem shown in Figure 1 is no different from 
architectures used on recent VAX systems. 1 The 
architecture most closely resembles that of the 
VAX 8800 series. Processors and memories reside 
on a single, high-speed interconnect called 
the XMI bus. All memory is shared and equally 
accessible by all processors. Adapters to the 
VAXBI bus also attach to the XMI I/O devices, 
in turn, are attached to the VAXBI buses. The 
XMI supports a total of 14 slots, which can 
be populated with modules to provide a wide 
range of system configurations. These con- 
figurations can range from small single-processor 
systems with 32 megabytes (MB) of memory 
and a single I/O channel to a large multipro- 
cessor system with 256MB of memory and multi- 
ple I/O channels. One of the primary system 
design goals was to support up to eight pro- 
cessors with very good multiprocessor perfor- 
mance This goal guided the performance 
decisions concerning the bus, memory, and pro- 
cessor designs. 

The heart of the system, the XMI bus, is largely 
a hybrid of the VAX 8800 NMI and VAXBI buses. 
The XMI is a synchronous bus that runs with a 
64-nanosecond (ns) cycle time. The data path is 
64 bits wide, and the maximum transfer rate is 
100MB per second. The protocol supports 
"pended reads" (as does the SBI on the 
VAX-1 1/780 system and the NMI on the 8800). 
In a pended read transaction, the CPU that wishes 
to read a location requests use of the bus. When 
the request is granted, the CPU transmits the 
address of the desired location. The appropriate 
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Figure 2 Fended versus Nonpended Protocols 



memory controller latches the address into an 
input queue and begins a read access to the 
specified location. In the meantime, bus owner- 
ship is relinquished by the CPU, and the bus 
may be used by other devices When the memory 
has completed the look-up and has the data, it 
makes a request for the bus. When granted the 
bus, the memory drives the requested data on 
the bus, which is latched by the CPU that 
originally requested the data Pended protocols 
are contrasted with nonpended protocols in 
Figure 2. 

Pended protocols are a big advantage when the 
bus cycle time is significantly less than the mem- 
ory access time. As a case in point, a memory 
read on the XMI bus requires about 500 ns 
(roughly 8 XMI cycles) Without a pended proto- 
col, these 8 cycles on reads would result in 
wasteful bus stalls Another advantage of pended 
protocols is that they allow multiple memory 
control lers to be used to advantage. In the case of 
the VAX 6200, it was not practical to build a sin- 
gle memory controller that could keep up with a 
saturated XMI bus But it was relative!)' easy to 
construct a memory controller that could com- 
fortably run at about one third the bus maximum . 
With four interleaved memory controllers on the 
XMI. memory controller bandwidth is greater 
than XMI bandwidth. 



Challenges of Multiprocessor Design 

The major challenges faced by the multiproces- 
sor system designer result primarily from one 
simple system characteristic. The intimate inter- 
face between processor and memory that most 
single-processor systems enjoy must be broken, 
and main memory must be shared among a large 
number of devices This sharing has several 
effects: 

■ Main memory access time is significantly 
increased. 

■ Bandwidth to main memory becomes a pre- 
cious commodity that determines overall sys- 
tem performance. 

■ Complexity results from increased bus traltic 
and parallel activities 

In the following sections, we expand on each of 
these effects in relation to the VAX 6200 system 

Increased Main Memory Access Time 
In a single-processor system, main memory is 
generally closely coupled to the CPU. An exam- 
ple of this closely coupled architecture is shown 
in Figure 3 Clearly, this architecture provides 
the potential for low-latency and high-bandwidth 
CPU-to- memory transactions 
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scntcd below in the section on the multiproces- 
sor environment. 

Table 2 reinterprets the data in Table 1 in 
terms of bandwidth instead of latency. For exam- 
ple, the Micro VAX 3600 system fetches 8 bytes 
of data from memory on a cache miss, which 
requires 8 (90 ns) processor cycles, or 720 ns. 
This corresponds to 8 bytes of data every 720 ns, 
or 11.1MB per second. In comparison, the 
VAX 6200 system fetches 32 bytes of data on a 
cache miss, which corresponds to 16.7MB per 
second (32 bytes of data every 1920 ns). 



In the VAX 6200 multiprocessor system, mem- 
ory must be shared by several devices and there- 
fore cannot be closely coupled to a single proces- 
sor. The result is a significant increase in main 
memory access time. Since the MicroVAX 3600 
and VAX 6200 systems are both CVAX-based, a 
comparison of the main memory access times for 
the two systems illustrates this point. Table 1 
shows the access time in processor cycles for 
the two-level cache subsystem and the main 
memory 

Table 1 shows that the VAX 6200 takes three 
times as many processor cycles to access the first 
longword in memory as does the MicroVAX 3600 
system. The main reason for this difference is that 
the MicroVAX 3600 memory controller actually 
resides on the CPU module. Therefore, the sys- 
tem architecture is optimized to provide mini- 
mum access time for processor accesses to main 
memory On the VAX 6200, system memory is a 
shared resource equally accessible by all CPUs 
and I/O devices. The price of this equality is 
increased latency on all memory references. Note 
however that although latency has increased, the 
VAX 6200 can support almost ten times more 
memory bandwidth (the time required per unit 
of data transferred). 

As will be later presented, the VAX 6200 sys- 
tem uses memory bandwidth to compensate for 
increased memory latency Trading bandwidth 
for latency is one of the fundamental tools of the 
multiprocessor designer. Cache memory systems 
essentially convert increased memory bandwidth 
(manifested as a larger fill size) into lower aver- 
age read latency (due to the decreased miss 
rate in the cache resulting from the larger fill 
size). This explanation is an oversimplification; 
details of the trade-offs in cache design are pre- 



Table 1 Comparison of MicroVAX 3600 
and VAX 6200 Memory Latency 



VAX 3600 VAX 6200 



Cache 1 






(CVAX internal cache) 


1 (90 ns) 


1 (80 ns) 


Cache 2 






(Second-level cache) 


2 (180 ns) 


2 (160 ns) 


Main Memory 






First longword 


5 (450 ns) 


14 (1120 ns) 


Second longword 


8 (720 ns) 


15 (1200 ns) 


Third longword 


na 


19 (1520 ns) 


Fourth longword 


na 


20 (1600 ns) 


Fifth longword 


na 


21 (1680 ns) 


Sixth longword 


na 


22 (1760 ns) 


Seventh longword 


na 


23 (1840 ns) 


Eighth longword 


na 


24 (1920 ns) 



Table 2 Comparison of Processor Read 

Bandwidths on MicroVAX 3600 and 
VAX 6200 Systems (in MB per second) 



MicroVAX 3600 VAX 6200 



Cache 1 






(CVAX internal cache) 


40.0 


50.0 


Cache 2 






(Second-level cache) 


20.0 


25.0 


Main Memory 






First longword 


8.8 


3.6 


Second longword 


11.1 


6.7 


Third longword 


na 


7.9 


Fourth longword 


na 


10.0 


Fifth longword 


na 


11.9 


Sixth longword 


na 


13.6 


Seventh longword 


na 


15.2 


Eighth longword 


na 


16.7 
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Limited Bandwidth to Main Memory 
In a single-processor system such as the 
MicroVAX 3600, the performance is generally 
limited by the CPU itself and not by the main 
memory subsystem. The opposite is generally the 
case on large multiprocessor systems where a 
large number of processors can create a bottle- 
neck to the main memory subsystem A major 
goal of the multiprocessor designer is to mini- 
mize the bandwidth required to support a given 
level of CPU performance. In that way, the main 
memory bus can support more processors, 
therefore, the system can attain higher total 
throughput. For example, assume a processor 
requires an average of 20 percent of the total 
bandwidth available to main memory to run a 
given workload. Based just on bus bandwidth 
considerations, the total system performance 
would not exceed five times the single-processor 
performance if the system is simultaneously run- 
ning that workload on all processors. For a num- 
ber of reasons, systems are rarely designed such 
that the bus must be saturated to meet its perfor- 
mance goals This same method of calculating 
performance can be used to estimate perfor- 
mance at some lower level of bus utilization A 
bus utilization level of 75 percent is often used, 
in that case, the system performance would be 
limited to 3 75 times the single-processor system. 

This example reveals one of the main com- 
promises multiprocessor system designers must 
make: increased bandwidth, which would reduce 
the main memory access time seen by a single 
processor, is traded off to reduce the total band- 
width consumed by a single processor and 
thereby increase total system throughput. Band- 
width is really not the characteristic we are trying 
to minimize; the real goal is to reduce the 
number of bus and memory cycles used to sus- 
tain a given level of performance As we will 
demonstrate, the efficiency of the transfer gen- 
erally increases as the transfer size increases. 
Therefore the system can fetch twice as much 
data from memory without using twice as 
many bus and memory cycles. This characteristic 
is important when evaluating various cache- 
alternatives 

Again looking at the MicroVAX 3600 design, 
the CPU actually starts accessing main memory 
once the first-level cache has determined a miss 
occurred but before the look-up in the second- 
level cache has completed. This overlap means 
the memory controller will start a large number 



of accesses that will never result in data being 
returned to the processor. (The second-level 
cache will probably "hit" on more than 80 per- 
cent of these references.) This behavior is desir- 
able for many single processor systems but 
would be inappropriate for a multiprocessor 
design in which main memory bandwidth is 
precious. 

In the multiprocessor system, main memory 
bandwidth is shared by all processors and I/O 
devices. Table 3 compares the system bandwidth 
in the MicroVAX 3600 and VAX 6200 systems. 
Since the VAX 6200 uses a pended bus that sup- 
ports 1 to 8 memory controllers, we present two 
sets of bandwidth numbers for the VAX 6200 
memory subsystem: one for a single memory con- 
troller and another for a four-way interleaved, 
four-memory controller subsystem. 

The data makes a strong argument for large 
transfer sizes to achieve high bandwidths on the 
VAX 6200. A large cache fill size can be used to 
assure high read bandwidth, and a write buffer 
can be used to provide longer length write trans- 
actions. Note that longword writes are particu- 
larly inefficient in the memory controller; nine 
cycles are required for a longword write com- 
pared with only five cycles for a quadword write. 
This inefficiency results from the implementation 
of the error-correcting code (ECC) across a quad- 
word on the VAX 6200 memory. (VAX systems 
have traditionally implemented ECC across a 
longword ) This implementation improved the 
memory module capacity at the cost of forcing 
all longword writes to be a read-mod ify-write 
sequence in the memory. 

Increased System Bus Traffic 
Another challenge to the multiprocessor designer 
is the increased memory traffic in the system due 
to the increased total system performance For a 
given workload, it is fairly accurate to assume 
that the traffic to main memory increases linearly 
with the total performance system. Therefore, a 
VAX 6240 (a four-processor 6200 system) would 
have roughly four times the main memory traffic 
of the VAX 6210 (a single-processor 6200 sys- 
tem). Since processors must monitor main mem- 
ory traffic to maintain cache coherency, this 
increase in main memory traffic has to be consid- 
ered when looking atcache invalidate implemen- 
tations. Again the single-processor system has a 
much less severe problem. The single processor 
has to monitor only the traffic from I/O devices, 
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Table 3 MicroVAX 3600 and VAX 6200 Main Memory Bandwidth (in MB per second with 
corresponding number of cycles in parentheses) 



MicroVAX 
3600* 

(90-ns cycles) 



VAX 6200 
XMI Bus 
(64-ns cycles) 



VAX 6200 
Memory 
(64-ns cycles) 



Reads 

Longword (4B) 
Quadword (8B) 
Oclaword (16B) 
Hexword (32B) 



8.8(5) 
11.1(8) 
na 
na 



31.2(2) 
62.2(2) 
83.3(3) 
100.0(5) 



1 Memory 
10.4(6) 
20.8(6) 
31.2 (8) 
38.5(13) 



4 Memories 
41.6 
83.2 
124.8 
154.0 



Writes 

Longword (4B) 
Full 

Masked 
Quadword (8B) 
Full 

Masked 
Octaword (16B) 
Full 

Masked 



11.1 (4) 
6.3(7) 



na 
na 



na 
na 



31.2(2) 
31.2(2) 

62.2(2) 
62.2(2) 

83.3(3) 
83.3(3) 



6.9(9) 
6.9(9) 

25.0(5) 
13.9(9) 

31.2 (8) 
16.7(15) 



27.6 
27.6 

100.0 
55.6 

124.8 
66.8 



' These numbers represent a CPU perspective. I/O devices on the Q-bus can use longer transler lengths. 



which typically generate about one-tenth the 
trattic generated by a single CPU. Extending this 
argument, it appears to indicate that a VAX 6240 
system must handle invalidate look-ups at a rate 
more than 30 times that of the MicroVAX 3600 
system (The VAX 6200 CPU has to handle invali- 
dates from three other CPUs and for about four 
timesas much I/Otraffic.) 

The increased system bus traffic is a symptom 
of the large number of parallel activities that 
characterize a multiprocessor system. The abun- 
dance of queues in a multiprocessor system 
results in a more complex system. The section on 
cache coherency in this paper discusses several 
manifestations of this increased complexity. 

Table 4 summarizes the major differences 
between the single-processor and multiprocessor 
systems. 

This discussion has demonstrated that the per- 
formance of a multiprocessor system is very 
dependent on the designers making the right 
decisions about the CPU interface. In the next 
section, we discuss the basic architecture of the 
VAX 6200 CPU and specific aspects of the multi- 
processing environment. 



VAX 6200 CPU Design Alternatives 

This section presents an overview of the 
VAX 6200 CPU architecture, followed by a dis- 
cussion of the various implementation alterna- 
tives that we considered during the design pro- 
cess. We conclude with a list of specific design 
alternatives and a discussion of our performance 
simulation environment, which we used to exam- 
ine these alternatives. 



Table 4 Summary of Differences 

between Single-processor 
and Multiprocessor Systems 



Characteristic 


Single- 
processor 
System 


Multi- 
processor 
System 


Memory latency 


Low 


Medium 


Performance 


CPU 


Memory 


bottleneck 




bandwidth 


Invalidate rate 


Low 


High 


Level of parallel 


Low 


High 


activity 
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The VAX 6200 CPU is a single-board VAX pro- 
cessor based on the CVAX chip set designed and 
built by Digital. The 6200 CPU has a CVAX cycle 
time of 80 ns (as compared to the MicroVAX 
3600 90-ns CVAX cycle time); its nominal per- 
formance is 2.8 times the VAX- 11/780 system 
(slightly more than three times the MicroVAX II). 

A block diagram of the module is shown in 
Figure 4. Three major buses are associated with 
the module The CVAX processor chip set com- 
municates over the CVAX data and address bus 
(CDAL). 23 The SSC chip connects to the CDAL 
bus and provides such functions as read-only 
memory (ROM) address decoding, time-of-year 
clock support, and console terminal interface/ 
The CVAX chip contains the first-level cache. Also 
connected to this bus is the second-level cache 
data store and tag store logic. The path to the XMI 
bus is provided entirely by the XMI interface gate 
array and the XMI corner This gate array provides 
all necessary synchronization between the CVAX 
and XMI Each CPU module has its own CVAX 



clock source, and the XMI bus has a single clock 
source that provides synchronous clock signals to 
all XMI nodes. 

The XMI corner represents a standard set of 
interface components and a physical intercon- 
nect that ensure all XMI devices meet the timing 
and electrical characteristics required by the XMI 
specification. The XMI corner components inter- 
face to the rest of the logic on the module over 
the XMI chip interconnect (XCI). Aduplicate tag 
store also attaches to the XCI bus. 

As outlined in the previous section, several 
specific challenges must be addressed by the 
multiprocessor designer. At the CPU level the 
design responses are as fol lows: 

■ Implement an effective cache to reduce the 
effective access time and the total traffic to 
main memory 

■ Implement a write buffer to decouple and 
reduce write traffic. 
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■ Implement a duplicate tag store to reduce the 
overhead and complexity of maintaining 
cache coherency. 

Cache Su bsystem 

Wc will first look at the issues associated with 
designing an effective cache. The main character- 
istics of a cache are size, associativity, fill size, 
and block size. Size is simply the size in bytes of 
the data store section of the cache. As the size of 
the cache increases, the effectiveness of the cache 
also increases. Associativity refers to the number 
of sets in the cache. A cache with a single set can 
store data with a particular tag address in only a 
single location. (A single-set cache is often 
referred to as a direct-mapped cache.) A two-set 
cache has two locations capable of storing data 
with a particular tag address. As associativity is 
increased, the likelihood of cache "thrashing" 
decreases (Thrashing occurs when two pieces of 
data cannot simultaneously be in the cache due 
to an insufficient number of sets.) The likelihood 
of thrashing also decreases as the cache size 
increases. Therefore it follows in most cases 
that as the cache size increases, the benefits of 
increased associativity decrease. Fill size defines 
the amount of data that is fetched from main 
memory on a cache miss and loaded into the 
cache. Over the range of cache sizes of interest, 
the miss rate decreases as the cache till size 
increases. Block size refers to the size of the data 
block covered by a single tag address. In a direct- 
mapped cache, the block size is equal to the 
cache size divided by the number of tags. The fill 
size is equal to or less than the block size. 

A major issue facing the designer of any com- 
puting system is the amount of variation in per- 
formance that can be accepted over a wide range 
of workloads Since we were concerned about our 
ability to accurately model the effect of large 
caches, we wanted to err on the side of conser- 
vatism. This meant we would choose the largest 
cache size practical. The state-of-the-art tech- 
nology static random-access memories (SRAMs) 
available to the VAX 6200 team were expected to 
be 256 kilobit (Kb) parts with speeds down to 
35 ns. We determined that a pipelined cache- 
design with 35-ns SRAMs could support CVAX 
cycle times down to 60 ns. This cycle time was 
comfortably beyond our product goal, which was 
to support a range of 70 ns to 1 00 ns, depending 
on the success we had speed-binning CVAX parts. 
Wc tentatively decided to use 64K-by-4 SRAMs 
for the data store, largely because the 64K-by-4 



configuration was expected to be the most read- 
ily available Since the CVAX has a 32-bit data 
path, eight 64K-by-4 pans would naturally pro- 
vide a 256KB direct-mapped cache (four times 
the size of any previous VAX) This configuration 
also provided the optimal one-output-load per 
data line. We also examined configurations with 
increased associativity to confirm our belief that 
the benefit of set sizes greater than one is small 
for caches in the range of 256KB 

Having selected a very large cache, we next 
considered block size and till size. The XMI bus 
supports only 8 (quadword), 16 (octaword), and 
32 (hexword) byte transfers to memory. There- 
fore, the till size would have to be one of these 
three sizes The block size can be larger than the 
fill size if the design supports what are called 
subblock valid bits. Ideally the fill size and block 
size would be the same With a very large cache, 
however, providing sufficient tag storage can be a 
real problem. Again in an attempt to be conserva- 
tive, we looked into state-of-the-art, tag-inte- 
grated circuits. The best we found in the 
required 25 to 30-ns speed range was a 2K-by-9 
part. With two of these parts, we could imple- 
ment a 2K tag store subsystem. A 256KB data 
store with 2K tags would have a 1 28-byte fill size. 
Subblock valid bits would be needed to identify 
which subb locks are actually valid We decided it 
would be practical to choose a larger tag store 
size in which four tag chips would be used to 
implement a 4K tag store subsystem. 

Choosing the ideal fill size was expected to 
involve an interesting compromise between sev- 
eral characteristics. A.s the fill size is increased, 
several things happen. 

■ The cache miss rate drops. Over reasonably 
large ranges, the miss rate can be reduced by 
requiring that more data be fetched on a cache 
miss. This is not true when the likelihood of 
using the new data is less than the likelihood 
that bringing in the additional new fill data 
will force the flushing of other cache data 
more likely to be used. s This will not occur 
with cache and fill sizes in the range consid- 
ered for the VAX 6200. 

■ CPU stalls per miss increase In VAX 6200 
CPU architecture, as the second-level cache is 
being filled, the CVAX cannot access it. On a 
second-level cache miss, the XMI interface 
does return the actual requested data item to 
the CVAX first and then completes the remain- 
der of the cache fill Therefore, the number of 
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cycles in which the CVAX is stalled waiting for 
the second-level cache to become available 
again after a cache miss increases as the fill 
size increases. The CVAX internal cache 
remains accessible while the second-level 
cache is being filled. 

■ The MB per second to main memory required 
to support a given level of performance 
increases. If twice as much data is fetched on a 
cache miss, the miss rate does not drop by a 
factor of two. 5 Therefore, as fill size increases, 
the MB per second required to support a given 
level of performance increases. 

■ The "available MB per second" of the bus 
increases The efficiency of buses that do not 
have separate address lines (such as the XMI) 
increases as the transfer size increases. Basi- 
cally, the required address cycle can be amor- 
tized over more data cycles. 

■ The "available MB per second" of the memory 
controller increases The memory controllers 
in the VAX 6200 can deliver more MB per sec- 
ond if more data is fetched for a given fetch 
address. 

Based on our significant experience with VAX sys- 
tems, we knew that either the 1 6-byte or 32 byte- 
fetch would be the right choice. The results from 
simulation would be used to select the final 
value. 

Another major cache design issue was the 
configuration of the CVAX 1KB internal cache. 
This cache can be configured to run in a conven- 
tional instruction and data stream write-through 
mode In this mode, the cache must be invali- 
dated when writes occur to a stored block. Alter- 
natively, the cache can be run in I-stream only 
mode in which the cache does not have to be 
invalidated on writes Instead, the cache is auto- 
matically flushed on VAX Return from Exception 
or Interrupt (REI) instructions. The methods we 
used to ensure the success of this cache 
coherency mechanism are discussed in the sec- 
tion Maintaining Cache Coherency and Handling 
Cache Error Conditions. 

Assuming all other things remain equal, 
there is a performance penalty for choosing the 
I-stream-only mode. If we select I-stream-only 
mode, the following occurs: 

■ All D-stream references will require a mini- 
mum of two cycles instead of one. Generally, 
for VAX CPUs an average of 0 8 D-stream 
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references are made per instruction and an 
average instruction on the CVA.X requires 
between 9 and 10 cycles. This would seem to 
indicate that the performance penalty would 
be about 8 percent (0.8 references divided by 
9 5 cycles), assuming the D-stream miss rate 
in the internal cache is 0 percent. With an 
expected more-typical 40 percent miss rate, 
the penalty would be about 5 percent. 

■ CVAX stalls will increase for references that 
occur while the second- level cache fill for a 
previous reference is still not complete. This 
increase results because the CVAX will need to 
access the second-level cache on all D-stream 
references 

■ Assuming a low frequency of REI instructions, 
the I-stream miss rate should improve since 
there will be no contention for cache blocks 
between the I and D streams. (REIs will cause 
the 1-stream-only cache to flush ) 

■ The module space needs will be less because 
there will be no need foran extra duplicate tag 
to track the CVAX internal cache. Since the 
CVAX internal cache has two sets, it cannot be 
practically "followed" by a simple second- 
level, direct-mapped cache. 

Looked at another way, we could afford to 
devote more logic to making the second-level 
cache more effective if we did not support 
CVAX D-stream caching. 

■ The complexity lessens with one less cache to 
keep coherent with hardware We also had 
more flexibil ity in implementing error-recov- 
ery mechanisms and would not have to imple 
mcnt a complex mechanism to suppress the 
generation of XMI write transactions when the 
invalidate queue was at risk of overflowing. 

We planned to use the simulation environment to 
quantify the performance penalty that results 
from running the CVAX cache in I-stream-only 
mode. 

Write- Buffer Subsystem 
Conventional write-through caches greatly reduce 
read traffic to main memory but do not reduce 
the write traffic Therefore, although the mix of 
read and write references from the CPU itself is 
weighted heavily toward reads, the traffic down- 
stream of a write-through cache is primarily 
writes. Other cache architectures offer the poten- 
tial to reduce write traffic A write-back cache 
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might be considered the obvious approach. By 
caching writes as well as reads, a write-back 
cache offers the potential for the highest perfor- 
mance multiprocessor system. Nevertheless, the 
complexity is significantly higher than a write- 
through design. Industry experience is that very 
few write-back caches work on first-pass, and 
their bugs are very difficult to fix Another risk 
with write-back caches is in the area of error 
recovery It is much more difficult to recover 
from transient cache errors with a write-back pro- 
tocol. To avoid the increased complexity and 
resulting schedule risk, we decided to pursue a 
hybrid approach. We would implement a write- 
through cache with a write buffer design very 
similar to that of the VAX 8800 cache. 7 

A write buffer resides between a write-through 
cache and the system bus. A write buffer is actu- 
ally a simple, very effective form of a write-back 
cache. A write buffer takes advantage of the local- 
ity of write transactions to reduce the number of 
write references to main memory by combining 
several small write references into a single larger 
transaction to main memory. This behavior has 
three main advantages. First, almost all buses 
(including the XMI) increase in efficiency as the 
transfer size is increased. This efficiency results 
because every transfer generally requires the 
transmission of an address cycle before the data. 
This address cycle is basically fixed overhead that 
can be more effectively amortized as the transfer 
size is increased. The transfer sizes and relative 
efficiencies of the XMI bus are shown in Table 3- 

Second, as previously mentioned, the VAX 6200 
memory does not efficiently process longword 
write transactions. The write buffer converts 
significant numbers of longword write transac- 
tions into full quadword and octaword transac- 
tions that are processed with many times greater 
efficiency. 

Finally, the buffer helps to reduce the fre- 
quency of processor "write stalls," that is, pro- 
cessor cycle slips due to writes to main memory 
that back up. The buffer largely decouples the 
processor from the main memory write timing; 
the processor perceives that most writes are com- 
pleted in minimum time. 

The VAX 6200 write buffer accumulates write 
data until a memory write address falls outside 
the address range of the current block. When this 
occurs, an alternate octaword buffer begins 
filling. The first buffer is emptied either with an 
octaword XMI transaction (if the buffer contains 
more than an aligned quadword) or with a quad- 



word XMI transaction (if the buffer contains no 
more than an aligned quadword). CVAX CPU 
reads (unless interlocked or made to I/O space) 
are allowed to bypass the write buffers after first 
being checked for an address match with the 
write buffer. 

Either a read address comparison match or an 
interlocked or I/O space transaction forces the 
write buffer to be purged. There are several other 
conditions under which the write-buffer must be 
flushed. These conditions are discussed in the 
section Maintaining Cache Coherency and Han- 
dling Error Conditions. 

We believed the write buffer could provide 
about half the bandwidth benefit of the write- 
back cache but with little more complexity than 
a simple write-through design. As an added 
benefit, the buffer architecture was already 
implemented and running with very good perfor- 
mance results in a VAX multiprocessor (VAX 8800 
family). We planned to use performance simula- 
tions to confirm that the write buffer was ade- 
quate to meet our performance goals. 

Duplicate Tag Store 

As noted earlier, a multiprocessor environment 
puts significant strain on the cache coherency 
logic. The rates at which write addresses on the 
system bus must be checked against the addresses 
stored in the cache require that a different archi- 
tecture be used for servicing invalidates. 

The 2K-by 9 tag chips used to implement the 
main tag store are also used to implement a 
duplicate tag store. The duplicate tag store runs 
synchronously with the XMI bus and permits 
filtering of invalidates, so the CPU would stall 
only on an XMI write hit. It is not uncommon to 
have ratios of 1 00 to 10,000 to I between dupli- 
cate tag misses and duplicate tag hits. 

The operation of the duplicate tag store is dis- 
cussed in the section Maintaining Cache Coher- 
ency and Handling Error Conditions. 

We have now defined the basic architectural 
issues that needed to be resolved and have indi- 
cated the alternatives we would like to pursue. In 
the next section we present the results of our per- 
formance simulations. 

The following list summarizes what we exam- 
ined in our simulation environment: 

■ Determine the loss in performance that would 
result from running the CVAX internal cache 
in I-stream-only mode instead of combined 
I- and D-stream mode 
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■ Investigate octaword (16-byte) versus hex- 
word (32-byte) fill sizes for both I-stream and 
D-stream. Further, examine the relative miss 
rates, MBs per unit of performance, bus cycles 
per unit of performance, memory cycles per 
unit of performance, and absolute perfor- 
mance. Look at a large multiprocessor system's 
sensitivity to main memory access time. 

■ Determine the effectiveness of a write-through 
with write-buffer cache architecture. In other 
words, can the writes be reduced sufficiently 
to avoid write-back in the chosen architecture. 

■ Examine the benefits of a two-way, set-associa- 
tive cache over a simpler direct-mapped 
design. 

Performance Simulation 

The basis of the simulation environment was a 
high-level performance model of the CVAX chip. 
Written in PASCAL, this model was interfaced to a 
configurable second-level cache, write buffer, 
and memory subsystem The model accepted 
instruction traces for input At the time the per- 
formance modeling was done, seven standard 
benchmarks were available: DIRECTORY, EDT, 
FORTRAN, LINKER, MAIL, Rl'NOFF, and SORT. 
All instruction traces were captured from a 
VAX 1 1/780 system. Since each trace was for a 
single process, one of the major issues was deter- 
mining how to correctly model the effect of 
timesharing on cache performance 

The very nature of timesharing has a negative 
effect on cache performance as compared with 
single process runs Ideally, the cache would be 
dedicated entirely to holding instructions and 
data associated only with a single process. In 
timeshare systems, processes are not initiated and 
then run nonstop to completion; instead the CPU 
is constantly switching from process to process 
This switching requires the cache resources to be 
distributed across a number of processes and 
therefore reduces the effectiveness of the cache 
A VAX- 1 1/780 study 6 indicates that the average 
number of instructions between context switches 
on a VAX system is about 5,000 instructions. A 
traditional and very conservative approach to 
simulating the effect of context switches is to 
flush the entire cache ever)' 5,000 instructions. 
Flushing the cache every 5,000 instructions 
was not a big penalty for small caches that 
could quickly refill themselves after a flush; 
however, the advantage of larger caches (that 



we know actually exists) could not be demon- 
strated when the model ran with a flush every 
5 ,000 instructions. 

To more accurately model the benefits of large 
caches, internal studies of complex timeshare 
loads were undertaken. Multiuser program traces 
were run against a cache model, subjecting the 
cache model to the context-switch behavior of a 
real system. The cache performance results of 
that run were compared with single jobs run 
against a cache model that was flushed after vari- 
ous numbers of instructions had been executed. 

The results indicated that similar cache perfor- 
mance results could be obtained in simulation by 
using a single job trace and complete cache 
flushing every 35,000 instructions. The number 
35,000 applies only to a 256KB cache; smaller 
caches would have a smaller context-switch 
interval. We decided to simulate the VAX 6200 
with the 256KB cache flushed every 35,000 
instructions; the 1 KB CVAX internal cache would 
be flushed at the more traditional 5,000 instruc- 
tion rate. All simulations would represent a sin- 
gle-processor system; main memory access times 
would be minimum. The performance results 
would generally be presented as a set of relative 
numbers comparing the alternatives. 

Table 5 summarizes all the cache characteris- 
tics we would simulate. 

CVAX Internal Cache Configuration 
The first aspect examined was the CVAX cache 
configuration. As shown in Table 6, the I- and 
D-stream design offered an average increase in 
performance of 5 percent over the l-stream-only 
cache. We concluded 5 percent average perfor- 
mance could be sacrificed in return for the 
reduced complexity of the l-stream-only design. 



Table 5 Cache Characteristics Simulated 





CVAX Cache 


Second-level 
Cache 


Associativity 


2-way 


Direct-mapped/2-way 


Configuration 


1 & D/l only 


I & D 


Size 


1KB 


256KB 


Block size 


8B 


64B 


Fill size 


8B 


16B/32B 


Tags 


1K 


4K 


Simulated 


5,000 


35,000 


context 


instructions 


instructions 


switch rate 
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Table 6 



CVAX l-stream and I- and D-stream 
Relative Performance 





l-stream 


I- & D-stream 


Average 


1.00 


1.05 


Minimum 


1.00 


1.03 


Maximum 


1.00 


1.07 



Octaword versus Hexword Fill Size 

Choosing an octaword or a hexword fill size was 
the next and probably the most complex major 
issue. The results are shown in Table 7. In all 
cases, relative numbers are used with the charac- 
teristics of the octaword machine as the refer- 
ence. 



Table 7 Octaword versus Hexword Fill Size Results 



Relative 
Fill Size 



Performance 



Octaword 
All 

Hexword 
Average 
Minimum 
Maximum 



1.00 

1.01 
1.01 
1.02 



Relative Miss Rates 



Relative MB/sec 



Fill Size 



l-stream 



D-stream 



All Reads 



l-stream 



D-stream 



All Reads 



Octaword 
All 

Hexword 
Average 
Minimum 
Maximum 



1.00 

.56 
.54 
.57 



1.00 

.84 
.81 
.87 



1.00 

.71 
.68 
.76 



1.00 

1.12 
1.08 
1.14 



1.00 

1.68 
1.61 
1.74 



1.00 

1.42 
1.36 
1.52 



Percent XMI 



Percent Memory 



Fill Size 



l-stream 



D-stream 



All Reads 



l-stream 



D-stream 



All Reads 



Octaword 
All 

Hexword 
Average 
Minimum 
Maximum 



1.00 

.93 
.90 
.96 



1.00 

1.40 
1.34 
1.45 



1.00 

1.18 
1.16 
1.27 



1.00 

.91 

.88 
.95 



1.00 

1.36 
1.31 
1.41 



1.00 

1.15 
1.13 
1.27 



Fill Size 

Octaword 
All 

Hexword 
Average 
Minimum 
Maximum 



Relative 
Percent 
XMI 
Utilized 



1.00 



Relative 
Percent 
Memory 
Utilized 



1.00 



1.08 
1.06 
1.09 



1.04 
1.03 
1.07 
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A summary of the results in Table 7 follows: 

■ The fill size has a negligible effect on perfor- 
mance (less than 1 percent difference). The 
hexword alternative delivered an average of 
1 percent better performance. 

It is important to keep in mind that the simula- 
tion was performed assuming the minimum 
delay from main memory. In a multiprocessor 
system, the alternative with the lower miss 
rate increases in performance relative to the 
other alternatives as the main memory access 
time increases. 

■ Hexword fetches dropped the overall miss rate 
by almost 30 percent. (As expected, the 
I-stream miss rate improvement was much 
higher — almost 50 percent.) 

■ The megabytes per second required to main- 
tain a given performance level increased by 
about 40 percent overall for the hexword 
fetch. 

■ As mentioned earlier, we were not as con- 
cerned about megabytes per second as much as 
the percentage of the bus and memory con- 
troller cycles per second. In this light the hex- 
word alternative required about 18 percent 

Table 8 Write Buffer Effectiveness 

Ratio With Write Buffer/ 
Without Write Buffer* 



Write Buffer XMI Memory 
Miss Rate Utilization Utilization 

Average 47.1% .55 .49 

Minimum 40.4% .50 .42 

Maximum 54.9% .64 .58 

' The utilization numbers are expressed as ratios between the 
utilization with a write buffer and the utilization without the write 
buffer. 



Table 9 XMI Bus Utilization per CPU 

l-stream D-stream 

Reads Reads Writes Total* 

Average .89% 1.39% 4.41% 6.27% 
Minimum .24% 1.26% 3.57% 5.27% 
Maximum 1.65% 2.10% 5.97% 7.25% 

" The numbers in this column are averages of the total XMI bus 
utilization across the seven workloads These numbers are not 
sums of the individual utilization percentages in each column. 



more bus cycles and 16 percent more memory 
cycles to support read traffic to main memory. 
Eighteen percent and 16 percent may seem 
like a big increase, but it is important to look 
at overall bus bandwidth. On a write-through 
interconnect, the writes generally dominate 
the traffic. 

■ The overall bus traffic (taking into account 
writes) increased by only about 9 percent. 
Overall memory controller cycles increased by 
even less — only about 4 percent. The low 
increase resulted because the ratio of write 
cycles to read cycles is higher in the memory 
controller than on the XMI bus. 

Based on this data, we chose the hexword 
fill alternative. We felt the potential for 
significantly more consistent performance in 
large multiprocessor configurations (due to 
decreased cache miss rate) was worth the esti- 
mated 9 percent increase in bus utilization 

Write Buffer Effectiveness and Overall 
Bus Utilization 

We were pleased to tind that the write buffer was 
about as effective as we had predicted. The data 
in Table 8 compares the XMI write traffic gener- 
ated with and without a write buffer. The data 
is quite consistent. On average, the write buffer 
reduced the number of write cycles on the bus by 
slightly less than half (45 percent) and reduced 
the memory controller cycles by slightly more 
than half (51 percent). 

Table 9 shows the bus utilization by the 
VAX 6200 CPU running the test benchmarks. 
Using the average bus utilization number of 
6.27 percent still yields only 50 percent for a full 
eight-processor system; the 7.25 percent maxi- 
mum value yields 58 percent utilization. These 
figures are well within our 75 percent utilization 
design goal, and we decided to implement the 
write-buffer instead of the write-back design. 

Another more conservative way to look at the 
data is to assume that we may not have the worst- 
case environment covered in any single bench- 
mark. Therefore we should look at the "sum of 
maximums" to determine whether the design 
goal is met. Using the sum of maximums 
approach, we require 9 72 percent of the XMI 
per processor, or about 78 percent for eight pro- 
cessors. This figure is sufficiently close to our 
design goal of 75 percent maximum utilization 
to be acceptable 
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Effect of Associativity 

We next explored the benefits of associativities 
greater than one. Implementation of a cache 
other than a direct-mapped cache was probably 
not practical. However, we wanted to examine 
the performance results. 

The results given in Table 10 indicate that a 
two-way, set-associative cache could reduce the 
overall miss rate by 1 3 percent, whereas the per- 
formance gain was negligible (1 percent). This 
improvement in miss rate is fairly significant; 
but we determined it was not practical from a 
module real estate and electrical timing perspec- 
tive to implement other than a direct-mapped 
scheme. To implement a fast two-way cache, two 
separate RAM arrays must be supported. This 
implementation requires roughly twice the mod- 
ule area of a directed-mapped approach. A two- 
way cache can be implemented with a single 
RAM array (cannot start the RAM look-up until 
the proper set has been identified), but this 
would force the access time to increase by a 
cycle. Increasing the access time to the second- 
level cache would be particularly undesirable to 
the VAX 6200 designers since we had already 
decided to configure the CVAX cache in I-stream- 
only mode. (With an additional cycle, all 
D-stream references would then require a mini- 
mum of three cycles.) Board area constraints and 
increased cache access time are the two most 
common reasons for rejecting the miss reductions 
of the multiway cache in favor of the simplicity 
and the practical, fast access time of the direct- 
mapped cache. 

Maintaining Cache Coherency and 
Handling Cache Error Conditions 

As mentioned in the introduction, a major chal- 
lenge to a multiprocessor designer is to imple- 
ment a reliable scheme for cache coherency. 
Coherency is a term somewhat difficult to define. 
In this section, we give some insight into the 



Table 1 0 Direct-mapped versus Two-way 
Cache Performance 





All Reads 






Relative 


Relative 




Miss Rates 


Performance 




Direct- Two- 


Direct- Two- 




mapped way 


mapped way 


Average 


1.00 .87 


1.00 1.01 


Minimum 


1.00 .74 


1.00 1.00 


Maximum 


1.00 .95 


1.00 1.02 



meaning of coherency and the methods employed 
by the VAX 6200 project engineers to ensure 
coherency. We also describe our techniques for 
supporting recovery from all single-bit transient 
cache errors. 

For this discussion, we divide the cache sub- 
system of the VAX 6200 into three sections. Fig- 
ure 5 shows the three major subsystems in the 
VAX 6200 cache: 

■ The CVAX internal I-stream-only cache 

■ The 256KB I-and D-stream cache 

■ The 16-byte write buffer (a form of write-back 
cache) 

CVAX I -stream- only Cache 
The first cache, contained within the CVAX chip 
itself, is configured for I-stream-only operation. 
In that mode, the CVAX flushes the entire con- 
tents of the cache whenever a VAX REI instruc- 
tion is executed. Motivated originally by the 
potential problems with instruction prefetch 
buffers, the VAX architecture defines rules for 
software to assure that writes to I-stream data 
produce predictable results. In all cases, if the 
rules are not followed, stale data may be read 
from the cache and cause unpredictable results. 
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Figure 5 VAX 6200 Cache Subsystems 



Digital Technical Journal 

No. 7 August 1988 



41 



Interfacing a VAX Microprocessor to a High speed Multiprocessing Bus 



Second-level I- and D stream 
256KB Cache 

The second-level cache is architecturally similar 
to caches used on most VAX systems. With a 
write-through design, the cache stores both 
1- and D-stream data. Coherency is maintained by 
monitoring all writes from other devices to main 
memory and invalidating cached locations that 
correspond to any of the monitored writes The 
processor does not generate invalidates for its 
own writes to main memory since the cache is 
write-through; a write by the processor itself that 
hits in the cache immediately updates the appro- 
priate location. 

The VAX 6200 second-level cache coherency 
logic is shown in Figure 6. A duplicate tag store is 
located on the multiplexed XCI bus. This store 
contains a duplicate copy of the 4 ,096 cache tag 
entries, which are in the second-level cache 
located on the CDAL. The duplicate tag store 
tracks the primary tag store on allocates by moni- 
toring XMI read transactions Whenever an XMI 
memory space read is initiated, the CPU allocates 
the cache block that corresponds to the read 
address. 

The duplicate tag store also monitors all XMI 
write transactions and performs a duplicate tag 
store look-up. If a hit occurs and the write was 
not from this CPU, then the duplicate tag location 
is invalidated. The address is then loaded into an 
eight-entry invalidate queue implemented in the 
XMI interface gate array. Cache invalidates are 
not performed in response to an individual CPU's 
own writes since the write-through second-level 
cache always contains the most recent data. 

When an entry has been loaded into the invali- 
date queue, the CDAI. interface logic arbitrates 
for the CDAL and invalidates the full 64-byte 
block in which the write address was located. 
The use of a duplicate tag store reduces CDAI. 
traffic to only necessary invalidate transactions. 
After performing an invalidate, the XMI interface 
gate array checks for any additional invalidates 
that may have accumulated while the previous 
invalidate was being serviced. If another invali- 
date request exists, then it is serviced prior to 
release of the CDAL. This procedure ensures that 
invalidates are serviced as quickly as possible. 
The CVAX bus interface ensures that the invali- 
date logic is given an opportunity to use the 
CDAL between every CVAX bus operation. 

Though occurring very infrequently, the XMI 
bus could issue writes quickly enough to over- 
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Figure 6 Second-level Cache Coherency Logic 

flow the CPU's invalidate queue. Instead of 
adding significant complexity to the invalidate 
controller to suppress the generation of XMI 
write commands when the invalidate queue is at 
risk of overflowing, the overflow condition is han- 
dled as an exception condition. (This subject is 
discussed in the section Handling Second-level 
Cache Error Conditions.) For this alternative to 
be practical, we had to ensure that invalidate 
queue overflows would be very rare; we felt this 
was ensured by the depth of the invalidate queue 
(eight entries) and the optimized design of the 
invalidate controller. 

The Write Buffer 

A write buffer design offers the designer oppor- 
tunities to break cache coherency rules. The 
VAX 6200 CPU follows several rules to maintain 
coherency. The VAX 6200 hardware automati- 
cally flushes the write buffer under the following 
conditions-. 

■ In response to a write that misses the currently 
active write buffer. The current write buffer is 
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(lushed while the new write is accepted by 
the alternate buffer, thus write ordering is 
maintained. 

■ Before an XMI I/O space read or write refer- 
ence is performed I/O references could result 
in the initiation of an I/O operation that may 
require the data from the write buffer. 

■ Before an interlock read or unlock write refer- 
ence is performed. Interlock sequences are 
the primary means for synchronization 
between processors and must always force all 
outstanding writes to main memory. 

■ Before an interprocessor interrupt is per- 
formed. As with interlocks, interprocessor 
interrupts are used for synchronization be- 
tween processors and must always force all 
outstanding writes to main memory. 

■ Before issuing an XMI read to a location that 
includes the data contained in the write 
buffer. The write buffer contents are flushed to 
main memory and then the XMI read is issued. 
Reads that miss the write buffer do not force a 
write buffer Hush ("write buffer bypass"). 

■ Following the assertion of the CVAX clear- 
write-buffer pin, the CPU flushes the write 
buffer to main memory. This form of write 
buffer flushing is primarily used to associate 
failed writes with a given process. If no associ- 
ation could be made, then the operating sys- 
tem would always have to crash the entire sys- 
tem on every failed write transaction. 

Handling Second-level Cache Error 
Conditions 

One of the major goals of the VAX 6200 design 
was to provide improved system reliability. One 
method we used was hardware-enforced soft 
failovcr in response to many error conditions, 
combined with efficient software recovery proce- 
dures. This method was used extensively when 
dealing with all types of second-level cache 
errors 

In general, the individual processors have the 
responsibility to recover from potential cache 
coherency failures. When errors occur that 
may leave the second-level cache incoherent, the 
VAX 6200 processor hardware automatically dis- 
ables the cache. Disabling the cache ensures that 
the system can continue to run "safely," albeit at 
reduced performance. The processor then posts a 
"soft" error interrupt. The interrupt service rou- 



tine responds by logging the error and then 
Hushing and rcenabling the cache 

The following error conditions cause the XCP 
hardware to disable the second-level cache. The 
errors are of two forms. The first two are error 
conditions that potentially result in a missed 
cache update on a write-through; the last three 
deal with conditions under which an invalidate is 
potentially missed: 

■ Subblock valid bit parity errors — The 
VAX 6200 CPU supports a doubly-redundant set 
of subblock valid bits. On a cache look-up, 
if the two corresponding valid bits do not match, 
then the hardware reports a parity error and 
forces a cache miss. If this error occurs on a 
write-through that should have hit in the cache, 
then the cache state is no longer consistent. 

■ Cache tag parity errors — The tag chips used 
on the VAX 6200 CPU support parity on the 
full tag address. As with valid bit errors, a tag 
parity error can result in a missed write- 
through. 

■ XMI inconsistent parity error — If the CPU 
detects an XMI cycle that has bad parity and 
that cycle is acknowledged by another proces- 
sor, then the worst-case assumption is that the 
duplicate tag logic just missed a write transac- 
tion that should have resulted in an invalidate. 

■ Duplicate tag store parity error — As with the 
previous error, the processor has to assume the 
parity error resulted in a missed invalidate 

■ Invalidate queue overflow — Again, this con- 
dition is similar to the one above except that 
this condition does not require a transient 
error in the system. Instead, an invalidate 
queue overflow is the result of a very rare com- 
bination of XMI writes that result in a queue 
backup and the potential loss of invalidates. 
The system responds to this condition just as it 
would for all other cache errors. 

Actual System Performance Results 

We were very interested in determining how well 
our simulation results matched real-world opera- 
tion. We decided to focus on several key aspects 
of the system to bound the task of correlating sim- 
ulation with the real world. Specifically, we 
planned to 

■ Confirm that the VAX 6200 CPU performs as 
expected relative to the MicroVAX 3600 sys- 
tems. If the cache subsystem behaves as 
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expected, then the VAX 6200 performance 
should exceed that of the MicroVAX 3600 
systems by the clock rate improvement minus 
the penalty for running the CVAX cache in 
I-stream-only mode. 

■ Confirm that the simulation traces adequately 
"stress" the memory interface such that extra- 
polation to real workload performance is 
valid. The percentage of the XMI bus con- 
sumed would be the basis for this comparison. 
This characteristic includes all the effects of 
references per instruction and miss rates and 
ultimately determines the performance of a 
multiprocessor machine. 

■ Confirm that the cache subsystem supports 
very effective utilization of multiple proces- 
sors. VAX 6200 multistream throughput mea- 
surements form the basis of this verification. 

■ Compare the results from the simulation tests 
with simi lar workloads run on real machines. 

Comparing VAX 6200 and 
Micro VAX 3600 Systems 
Due to the similarities between the two systems, 
our first approach was to compare the perfor- 
mance of the VAX 6200 to the MicroVAX 3600 
systems by running a set of I 00 compute-inten- 
sive benchmarks. The VAX 6200 has a 1 2 percent 
cycle time advantage (90 ns to 80 ns), but it is 
somewhat handicapped by the I-stream-only limi- 
tation placed on the internal cache. Recall that 
our performance simulation indicated this 
penalty would average about 5 percent (See 
Table 6.) On average then, we expected the 
VAX 6200 CPU to be about 7 percent faster than 
the MicroVAX 3600 CPU. The compute-intensive 
benchmarks basically confirmed this number; 
VAX 6200 performance averaged 6 percent faster 
than the MicroVAX 3600 CPU. 

Multiprocessor Bus Bandividth 
Utilization — Real and Simulated 
Workloads 

We have run several forms of multiuser time- 
sharing workloads on the VAX 6200 system. 
These workloads include Digital's standard 
ALL-IN- 1 workload, an order processing bench- 
mark (Compu-Share), an electrical CAD work- 
load, and a software development workload. 9 In 
all cases, the average percentage of the XMI used 
per processor ranged from 3 75 to 5.0. Recall 



that our simulation indicated that the percentage 
XMI consumed would be 6.27 percent. (See 
Table 9 ) 

Multistream Performance on Compute- 
intensive Benchmarks 
It is beyond the scope of this paper to present the 
multiprocessor simulation data that was gener- 
ated prior to design. That data indicated that the 
VAX 6200 system performance on compute- 
intensive benchmarks would be nearly linear 
when running from one to eight processors. 

Tests to date have confirmed our high expecta- 
tions On compute-intensive workloads, a four- 
processor system consistently provides better 
than 3- 95 times the throughput of the single- 
processor system (less than 2 percent degrada- 
tion). Limited configuration testing on systems 
with up to eight processors indicates that 
compute-intensive workloads continue to per- 
form very well. An eight-processor system per- 
formed at 7 75 times the single-processor (less 
than 5 percent degradation). 

Fully Characterized Workloads 
We also instrumented a VAX 6200 system to mea- 
sure a number of processor characteristics, 
including bus utilization. We wanted to deter- 
mine how much the real workload runs varied 
from the simulated runs The test methodology 
was quite simple. 

■ Command files were created that executed a 
single benchmark. These individual bench- 
marks were designed to correspond with the 
simulation traces listed at the beginning of the 
Performance Simulation section. 

■ The Digital Command Language (DCL) com- 
mand fileswereof the following form: 

$ 

$ gflushcache! 1 n 1 t i a 1 1 y f 1 ush the 
cache 

$ % s t a r t har dwa re samp 1 e ! start the 

measurement hardware 
* ggetcputime ! get the initial CPU time 
$ 

$ run benchma r k 
$ 

$ ggetcputime! get the final CPU time 
$ gstophardwaresamp 1 e ! stop the 
measurement hardware 

$ 
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■ The measurement hardware consisted of two 
Tektronix DAS 9200 Logic Analyzers; one mon- 
itored the processor bus, and the other was 
attached to the XMI. The start-measurement 
command file simply referenced a specific 
XMI I/O space address on which the DAS 9200 
analyzers would trigger and start taking mea- 
surements. Similarly, the stop-measurement 
command file would reference another XMI 
I/O space address that would cause the logic 
analyzers to stop acquiring data. 

This technique made the measurement process 
simple and repeatable. The overhead of the com- 
mand file was measured by running the command 
file with the "run benchmark" line removed This 
overhead was then subtracted from the results 
obtained from benchmark runs. Run-to-run con- 
sistency was better than ± 1 0 percent. 

The logic analyzers captured the data neces- 
sary to determine the total number of XMI read 
and write references that occurred during the 
execution of the command file. This data was 
used to calculate the total number of XMI cycles 
used by the processor. To derive the percentage 
of the XMI utilized, the total XMI cycles were 
reduced by the command file overhead, and the 
result was divided by the benchmark CPU time. 
This method ensures that the XMI percentage is 
not artificially low due to the inclusion of null 
time elapsed while the processor is waiting for 
I/O activities associated with the benchmark to 
complete. The results are shown in Table 1 1 . 

The data indicates that the simulation traces 
required significantly more XMI read bandwidth 
(on average more than double) than the similar 
actual benchmarks. This result is not unexpec- 
ted, since the simulation runs were designed to 
simulate a worst-case timeshare workload. (This 
goal influenced the choice of 35,000 instruc- 
tions for the cache flush interval.) The real 
workloads were run on standalone systems, and 
therefore the cache performance was expected to 
be higher. We are currently studying the effect of 
heavy timesharing in multiprocessor systems on 
cache performance. Initial results indicate that 
our simulation runs are still conservative. 

The results for writes, which are unaffected by 
context switch rates, matched the actual bench- 
marks quite closely. The actual benchmarks 
required about 4 percent to 8 percent more 
bandwidth than the equivalent simulation trace. 
Combined read and write bandwidth require - 



Table 1 1 Simulated versus Actual XMI Bus 
Utilization 





Simulated 
l-stream 


Actual 
l-stream 


oimuiaiea/ 

Actual 

Ratio 


Average 

Minimum 

Maximum 


0.84% 
0.24% 
1 .65% 


0.32% 
0.17% 
0.52% 


2.6 
1.4 
3.2 




Simulated 
D-stream 


Actual 
D-stream 


oimuiaiea/ 

Actual 

Ratio 


Average 

Minimum 

Maximum 


1.63% 
1 .26% 
2.10% 


0.74% 
0.26% 
1.10% 


2.2 
4.8 
1.9 




Simulated 
Writes 


Actual 
Writes 


oimuiatea/ 

Actual 

Ratio 


Average 

Minimum 

Maximum 


4.46% 
3.57% 
5.97% 


4.86% 
3.84% 
5.75% 


0.92 
0.93 
1.04 




Simulated 
Overall 


Actual 
Overall 


Simulated/ 

Actual 

Ratio 


Average 

Minimum 

Maximum 


6.09% 
5.27% 
7.25% 


4.86% 
3.84% 
5.75% 


1.25 
1.37 
1.26 



ments indicated that the simulated traces used 
25 percent more bandwidth than the actual 
workloads. 

Conclusions and Future Work 

The VAX 6200 design experience has demon- 
strated that trace-driven simulation is a power- 
ful tool in the design of a multiprocessor bus 
interface. Because the designers were able to 
make informed trade-off decisions, the design 
met or exceeded all performance goals; and the 
reduced design complexity helped bring the 
system to market on schedule. It is a tribute to 
the team's appropriate control of complexity and 
to the rigorous verification process 1 that the 
first-pass VAX 6200 CPU printed circuit design 
and XMI interface gate array are currently ship- 
ping in VAX 6200 systems At Digital, this level 
of success is unprecedented for a machine of this 
complexity. 
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The continuing trend toward multiprocessing 
and faster processors will force increasing depen- 
dence on complex cache subsystems to deliver 
the desired system performance. It follows that 
minimizing the complexity of the cache sub- 
system will help support ever decreasing time- 
to-market schedules. Accurate cache simulation 
techniques will be required to select the imple- 
mentation that meets the performance goals and 
is minimally complex. 
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The Role of Computer-aided 
Engineering in the Design of the 
VAX 6200 System 

The success of the VAX 6200 design is partly attributable to the development 
and implementation of a total verification plan. The goal of this plan was 
to shorten the total system design cycle; the approach was to perform suffi- 
cient verification to ensure that first- pass parts would boot and run VMS at 
speed. ITie team responsible for achieving the goal began implementing 
the verification process on availability of the first design specification. The 
team's efforts continued concurrent with those of the module design team. 
Milestones for the process reflect the verification team's top-down func- 
tional approach, proceeding from architectural-level verification through 
logic, timing, and system verification, and concluding with vector genera- 
tion. Review and reporting methods established for the project ensured all 
functions were tested and verified. 



This paper presents an overview of the computer- 
aided engineering (CAE) and CAE-based design 
verification test (DVT) approach to the develop- 
ment of the VAX 6200 system. Our intent is not 
to give a step-by-step description; therefore, few 
details of the implementation are given. The 
CAK/DVT Group developers believe that project- 
specific problems are generally best solved by 
project-specific solutions. Instead, we offer a 
broad overview of CAE which includes the engi- 
neering principles established for the VAX 6200 
project and which wc believe will be of use to 
those planning a task of similar scope. 

A Brief VAX 6200 System Overview 

No discussion of CAE or DVT methodologies can 
take place without a description of the task to 
which these methods are applied. For our pur- 
poses, the overall task was to engineer, proto- 
type, debug, and release for manufacture the 
VAX 6200 mid-range computer system. 

The VAX 6200 multiprocessor architecture 
is implemented with CMOS technology.' 2 The 
system is housed in a 156 by 79 by 76 cm cabi- 
net, which contains a system bus backplane, two 
6-slot VAXB1 backplanes, a TK50 tape drive, 
space for future rack-mount devices, power sup- 
plies, and blowers 



The heart of the system is a new interconnect 
called the XMI. This interconnect was specifi- 
cally designed to serve as the processor-to- 
memory interconnect in the VAX 6200 system 
and its derivatives. Optimizations of and trade- 
offs in the design of the XMI were made with that 
function foremost in mind. The key features of 
the interconnect are as follows. 

■ The pended bus design allows multiple trans- 
actions to be in progress at the same time; thus 
waste of bandwidth is minimized, for instance, 
during memory read accesses. 

■ The XMI implements the concept of comman- 
der nodes and responder nodes. A commander 
node initiates a bus transaction to which a 
responder node must respond. 

■ The XMI is a centralized arbitration intercon- 
nect. Arbitration logic, resident on the back- 
plane, grants bus mastership according to a 
modified round-robin scheme There is a 
higher priority responder round-robin queue 
and a lower priority commander round-robin 
queue 

■ Bus width is 64 bits. 
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■ Cycle time is 64 ns. 

■ The XMI supports reads of quadword, octa- 
word, and hexword length, and writes of quad- 
word and octaword length. 

■ Raw bandwidth is 125 megabytes (MB) per 
second 

The XMI supports three module types in the 
VAX 6200 system: the CPU module (KA62A), a 
32MB memory array (MS62A), and an XMI-to- 
VAXBI adapter module set (DWMBA). 

The CPU module is based on the CMOS VAX 
(CVAX) chip set, which includes a MicroVAX 
architecture microprocessor (CVAX), a floating 
point accelerator (CFPA) chip, and a system sup- 
port chip (SSC). The module supports full VAX 
capabilities, excepting only PDP-1 1 compatibil- 
ity mode. In addition to a two-way associative 
1-stream cache in the CVAX chip, the module 
contains a 256 kilobyte (KB) direct-mapped 
cache. Performance is approximately 2.8 times 
that of a VAX- 1 1 /780 processor. 

The MS62A is a 32MB memory array module 
with an on board controller. Modules may be 
interleaved up to eight ways to decrease laten- 
cies. Each module has an eight-deep command 
queue. The arrays are fully error-correction code 
(ECC) protected. 

The DWMBA is an adapter module set which 
allows the 6200 system to access I/O devices on 
the VAXBI bus. The DWMBA/A module, which 
resides in a single XMI slot, is connected by cable 
to the DWMBA/B module, which resides in a sin- 
gle VAXBI slot. The DWMBA can support up to 
full VAXBI bandwidth of 13. 3MB per second on 
write transactions and approximately 5.5MB per 
second on read transactions. 

Figure 1 illustrates how these system elements 
interconnect in a two-processor system with two 
VAXBI channels. 

Because the VAX 6200 system backplane has 
14 slots, many system configurations are possible 
with differing numbers of processors, memory 
modules, and I/O channels. 

In the sections following, we describe the engi 
neering process employed in the design of these 
logic elements 

CAE Verification Challenges 
and Organizational Structure 

The overriding goal of any CAE effort is always 
the same: to shorten the development time 
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Figure 1 XMI Module Connections on a 
VAX 6220 System 

needed to bring a product to market. The defini- 
tion of CAE and the way engineers use CAE to 
accomplish this goal differs from project to pro- 
ject and even within a single project. Neverthe- 
less, two principles are preeminent 

1. CAE should provide the tools, the methods, 
and perhaps most importantly, the discipline 
that together enhance an engineer's productiv- 
ity without unduly restricting his or her cre- 
ativity. 

2. CAE should provide a continual check to 
ensure that the engineer's product meets the 
needs of the project in terms of both function 
and quality. 

The role of the CAE/DVT Group on the 
VAX 6200 project was different from the tradi- 
tional CAE role in one significant respect. The 
group's primary responsibility would not be the 
development of CAE tools and processes. Instead, 
its responsibility was the delivery of first-pass 
hardware that was functional at speed. Explicitly, 
our goal was to ensure that the system would 
boot the operating system (VMS) and run soft- 
ware the first time the system was powered up. 
The only tools and processes developed were 
those specifically necessary to fulfill that goal. 

The project team felt that objective simula- 
tion and verification of the hardware and its 
performance by the CAE/DVT Group would 
(1) enhance the chances of first-pass functional- 
ity, and (2) reduce the overall design cycle by 
paralleling the CAE and the design efforts. Con- 
sequently, the CAE engineers were active con- 
tributors to the architecture and participated in 
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choosing alternatives, effecting compromises, 
and implementing details of the design. The CAE 
Group was responsible for the correctness and 
quali ty of the designs and not just for the delivery 
of tools to accomplish that correctness. To 
achieve this goal, tasks traditionally performed 
using DVT methods would be accomplished 
using CAE methodology. 

CAE Tasks 

Given the charter described above, the CAE/DVT 
Group outlined the following tasks: 

■ Select a tool suite 

■ Create a process for the CAE effort 

■ Maintain the databases 

■ Construct a CAE environment (models and 
computes) 

■ Generate test cases to run against the environ- 
ment 

■ Isolate and report bugs 

■ Verify the hardware 

■ Generate test vectors for outside vendors 

■ Generate test vectors for manufacturing 

■ Fau 1 1 grade t he test vectors 

■ Define exit criteria for committal of design to 
hardware 

■ Enforce compl lance with exit criteria 

Though the list is long and has some interesting 
tasks, two items constituted the largest portion of 
the work: generation of test cases to run 
against the environment , and verification of 
the hardware. 

The generation of test cases is the most time- 
consuming, least glamorous, and most often over- 
looked task; yet the test cases are the single most 
important piece of a superior CAE effort. A suc- 
cessful specification of the test cases (the DVT 
specification) to be run against a CAE environ- 
ment requires a lengthy period of development. 
The development time for the KA62A, MS62A, 
and DWMBA DVT specifications was approxi- 
mately 6 man-months each. Moreover, the speci- 
fication is not static and must be kept current 
with the evolving design. 



The DVT specification must begin as early 
as specification of the hardware functionality 
begins. Working the two specifications in parallel 
ensures functional verification of the system. Fur- 
ther, the DVT specification should be treated 
with the same formality as the hardware specifi- 
cation; that is, it should be reviewed, and all 
reviewers must agree upon its completeness. By 
formalizing the specification review, project 
members are in effect establishing i ts value to the 
project. The DVT specification defines what is to 
be simulated; therefore, superior design tools 
and modeling cannot substitute for the assurance 
of design accuracy that the specification affords. 

As to the verification of the hardware, the 
responsibility of the CAE team was to ensure bug- 
free and operable component, module, and sys- 
tem designs. Team members ran the simulations, 
isolated the bugs, and ensured designs were cor- 
rected by the design team. Simulations were not 
done exclusively by the CAE team, however. The 
environment was available equally to all design 
team members. To the extent that each team felt 
was appropriate, designers initially debugged 
their designs before passing them to the CAE 
team for more formal debug. In this way, obvious 
bugs were found more quickly. Design develop- 
ers did excellent work in this regard and greatly 
eased the burden on the CAE team. 

Further discussion of the VAX 6200 hardware 
verification is presented in the section Verifica- 
tion Milestones. 

Modeling Approach 

Hardware verification done in software is by 
nature a slow process. The major factor con- 
tributing to the slowness of the verification is the 
size of the design. The size is not simply the num- 
ber of logic elements in the design, but the col- 
lective size of the models of each of the elements 
in the logic network. 

We used two types of models for the VAX 6200 
project, behavioral and structural (or gate level). 
Behavioral models, in general, were more 
abstract and efficient in terms of increasing 
overall simulation performance as compared to 
detailed structural models. 

Behavioral models of many of the components 
used in the system were generated early in the 
design cycle. As the design progressed and 
detailed logic schematics became available, how- 
ever, the behavioral models, in most cases, gave 
way to detailed structural models The exception 
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was the behavioral models of the CVAX chip set. 
These detailed models were used throughout the 
verification process. Given the size and complex- 
ity of these components, simulation with struc- 
tural models, for all practical purposes, was 
impossible. 

In general, our objective throughout the 
verification process was to ensure accuracy and 
not speed. The slow speed of the more accurate 
models was addressed by applying more compute 
power to the task at hand . 

CAE Staffing and Resources 
The CAE group was divided into small teams, 
each responsible for the verification of a VAX 
6200 subsystem. The size of the teams varied. 
The KA62A gate array and module team had four 
CAE engineers. Four CAE engineers worked on 
the DWMBA gate arrays and modules. The MS62A 
gate array and module was assigned one CAE engi 
neer. As it turned out, these numbers represented 
nearly a one to one ratio with the hardware 
designers. As senior, experienced engineers, 
team project leaders were responsible for the 
overall coherence of the DVT plan and its quality, 
and were responsible as well for tracking and 
resolving problems. 

Each team included a diagnostic engineer who 
was also working on design verification test. This 
arrangement provided the diagnostic engineers 
early training and also facilitated testing More- 
over, diagnostic engineers were in a position to 
easily evolve some of the DVT tests into self-tests 
and ROM-based diagnostics for the VAX 6200 
product 

The educational background of the CAE team 
was a mix of electrical engineers, computer engi- 
neers, and software engineers. Their levels of 
experience varied from new college hires to 
those with 10 or 15 years of work experience. 
The level of relevant hardware experience in this 
group is indicative of the group's tasks, as com- 
pared with other CAE groups that are more 
involved in tools generation. 

Our computer resources consisted of a cluster 
of eight CPUs, including one VAX 8800 system, 
one VAX 8650 system, and six VAX-1 1/780 sys- 
tems. All four modules (KA62A, DWMBA/A, 
DWMBA/B, and MS62A) and their associated 
gate arrays were verified throughout most of 
the project on this cluster. During final regres- 
sion testing of each module, in which the full 
set of DVT tests was run against the design, 



an additional cluster of eight VAX 8800 systems 
was used. 

Verification Milestones 

Key milestones were established for the verifica- 
tion team throughout the VAX 6200 design 
verification process. In December 1985, we 
began with the first XMI interconnect verifica- 
tion; we proceeded to performance evaluation, 
logic verification, timing verification, system 
verification, and vector generation. 

These milestones were derived as part of 
our functional top-down verification approach. 
We selected this approach based on our deter 
mination that if a function works correctly, then 
all of its component logic must be working 
correctly. 

We therefore chose to model our different 
design objects in the largest reasonable forms and 
then functionally test these models. Every step 
naturally lead into the next task of the system 
design. This approach was later extended to the 
system as a whole; the system simulation com- 
bined the logic and ran code to exercise the 
entire system. 

Architectural Verification 
At the architectural level, the simulations focused 
on the verification of the new system intercom 
nect, the XMI. As noted earlier, this interconnect, 
specifically developed for the VAX 6200 system, 
is a memory interconnect bus with a new arbitra- 
tion scheme and a defined bus interface protocol. 
Both the arbitration and the protocol are imple- 
mented inCMOSsemicustom technology. 

Once the design for the bus protocol and arbi- 
tration was establ ished i n a specification form, we 
immediately transformed the specification into 
high-level behavioral models: the arbiter chip 
model, and an XMI commander transactor model. 
The behavioral arbiter model represented a 
generic, round-robin arbitration scheme; the 
commander model represented a generic XMI 
commander design. The commander model con- 
tained a flexible user interface to allow the 
specification of any desired well- or ill-formed 
transaction to be generated on the bus. Further, 
the commander transactor model was designed to 
selectively self-check for any protocol violations. 

The two models were the basis for all XMI 
design verification This first level of verification 
provided feedback to the architecture team 
quickly and answered questions about the inter- 
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face protocol and arbitration scheme. As a result, 
the arbitration was enhanced and the protocol 
was refined to satisfy the design goals. Specifi- 
cally, a few new signals were added, and the arbi- 
tration was changed from true round-robin to a 
modified round-robin. 

At the next level of architectural verification, 
we modeled an XMI responder node and incorpo- 
rated this model into the simulation environ- 
ment. The team developed a behavioral XMI 
memory model and completed a high-level sys- 
tem model This model was still totally behav- 
ioral and represented a system with generic XMI 
commanders and responders. 

Two pieces of test code were generated and 
verified on that model. The environment mod- 
eled a full) loaded XMI interconnect. The first 
piece of test code was structured in such a way 
that every node on the XMI generated its own 
traffic. Commanders generated all possible com- 
mander sequences, and responders generated all 
possible responder sequences. The goal of this 
first test code was to ensure that the protocol was 
sound. By protocol soundness, we mean that 
commanders and responders can coexist on the 
XMI and can generate traffic sequences without 
loss of data The results of this verification gave 
the team sufficient confidence in the protocol to 
allow the design of the XMI interface compo- 
nents to proceed. 

The second piece of test code was verified on 
the same environment. Every commander gener- 
ated the same sequence of traffic on the XMI. The 
goal of this test was to verify arbitration fairness 
and to guarantee that all XMI nodes got their fair 
share of the XMI The absence of phenomena 
such as lockouts was also verified. 

This architectural verification proved to be a 
tremendously valuable exercise. First, feedback 
to the architecture team was accomplished 
quickly. Second, this architectural verification 
for the VAX 6200 project established design 
verification tools that can be used for all future 
XMI designs. 

In time, the behavioral model of the arbiter was 
replaced with a structural model derived from 
the chip design database. We enhanced the accu- 
racy of the behavioral models of the XMI com- 
mander and memory by incorporating structural 
models of the XMI interface components once 
their gate-level designs were complete These 
tools are now in use throughout the corporation 
by numerous XMI design teams. 



Clearly, an architectural verification that con- 
centrates on a new bus leaves out many other 
areas of architectural interest. A severe restriction 
of the scope of the VAX 6200 system's architec- 
tural verification was deemed necessary because 
of the lack of schedule time and because of the 
immaturity of the art. Nevertheless, architectural 
verification is a key area where much work should 
be done for the development of the next system. 

Performance Evaluation 
The next verifi cation task was performance evalu- 
ation. Again, work was concentrated into two 
well-defined areas, that is, the bandwidth perfor- 
mance of the XMI, and the processor perfor- 
mance in the multiprocessing environment. 

A model of the CVAX processor was obtained 
from the Semiconductor Engineering Group 
design team. We enhanced this model to include 
an XMI interface with a memory port. The stimu- 
lus for this model proved difficult to generate 
because multiprocessing benchmark traces were 
not available. The traffic patterns had to be 
deduced from single stream benchmark traces 
and extrapolated for VAX 6200 symmetric multi- 
processing. 

We ran several benchmarks. We then used the 
results to make decisions about the appropriate 
trade-offs in the area of the processor cache and 
write buffer algorithms. These trade-offs dealt 
specifically with cache and write buffer depth 
versus performance gained. 

Other tools were created to decompose XMI 
traffic into histograms and to generate reports on 
bus bandwidth for the different types of traffic. 
Eventually, XMI memory design latency targets 
were incorporated into the XMI behavioral mem- 
ory model. These system performance simula- 
tions were used to establish such design criteria 
as the memory controller input command queue 
depth and the command queue processing 
algorithm. 

Logic Verification 

The next major task was logic verification. The 
main objective of module verification was to 
ensure that the implementation conformed to all 
design goals documented in the system specifica- 
tion. In other words, the goal was not to verify 
what the design was, but what the design was 
supposed to be. 

Members of the CAE team were assigned to 
each design object; each member would work in 
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a team with the designers The verification teams 
required complete and coherent specifications 
for each design object These specifications had 
to be sufficiently complete to support both 
design implementation and logic verification 
Moreover, all functions had to be documented in 
a specification. This documentation served two 
purposes: (1) to ensure that the function 
received the proper attention during the verifica- 
tion phase, and (2) to give the responsible CAE 
engineer the information needed to understand 
the functions without referring to logic schemat- 
ics or meeting with the designer. 

With the functional specification as the founda- 
tion, team members generated a verification 
working document for every design object. This 
DVT specification, as mentioned earlier, guided 
the verification work and constituted the primary 
hardware-submittal exit criteria. 

The logic verification was grouped into three 
categories: 

■ Basic functional verification. Basic functional 
tests exercised each function as a standalone 
piece of the design. This testing isolated obvi- 
ous bugs. 

■ Interaction sensitivities. Interaction sensitivity 
exercised the design as a whole, making sure 
that functions could interact with each other 
and could occur in series without cumulative 
fault mechanisms. Testing of function interac- 
tion included any boundary conditions, margin 
testing, and back-pressure on different key 
points in the design. 

■ Error handling Error handling verification 
tested that portion of the design created speci 
fically for error detection and recover)' mecha 
nisms. 

Timing Verification 

Timing verification was performed separately 
upon all key components in the VAX 6200 sys- 
tem. All of this work was performed by applying 
functional patterns to timing models for each of 
the module gate arrays and the XM1 arbiter logic. 
This work was done using AUTODLY, an internal 
Digital tool. 

The XMI components were tested first Testing 
consisted of applying all possible XMI bus cycles 
against this logic while allowing the timing 
verifier to analyze the logic for any timing paths 



with problems. A number of problems were 
found and resolved as a result of this testing. 

The timing verification of the module gate 
arrays was performed in a similar fashion. Pat- 
terns of functions were extracted from logic 
verification and then applied to the standalone 
chip timing models. As each pattern was applied, 
the timing verifier would run a complete check 
of the gate array and generate a list of violations. 
These violations would then be checked by the 
designer. If they were valid, logic changes would 
be made. The reason that just the gate arrays were 
verified, and not their complete modules, was 
that each module contained some logic for which 
no structural model existed (for example, the 
CVAX chip set on the KA62A module). The lack 
of a complete module-level timing verification 
model was rectified by requiring the module 
design team to thoroughly analyze its module. 
This approach was possible only because of the 
highly bus structured nature of our technology 

System Verification 

Once every design object met its exit criteria and 
satisfied the specified testing, the next milestone 
was the start of system simulation Our task was 
to verify the actual design in a system environ- 
ment. We constructed a model consisting of mul- 
tiple processors, memories, and I/O modules. 
This model contained structural representations 
of the actual designs wherever possible. Where 
there were multiples of a design object in the sys 
tern simulation environment, one instantiated 
copy of the model would be the detailed (and 
slow to simulate) structural model; the other 
instantiations were the faster yet less accurate 
behavioral models. 

In addition to actual design objects in this sys- 
tem model, we included different types of trans- 
actor and traffic generators on both the XMI and 
the VAXBI buses 

The stimulus for this environment had to be 
specific enough to ensure that every type of traffic 
pattern was generated during simulation. The 
stimulus attempted to stimulate every node and 
function concurrently. In a system simulation in 
which the simulation rate is so slow, as much as 
possible must be achieved in every single simu- 
lated clock tick. 

Key to making a system simulation successful 
is to start the simulation only afterthe constituent 
pieces of the system have been very thoroughly 
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verified in isolation Given the complexity of the 
system model and its slow running rate, finding 
simple design bugs at this stage is a waste of 
schedule time. Instead, the model should iden- 
tify the system interaction problems and assure 
developers that the base logic verification was 
thorough. 

Several logic problems were found during our 
system simulation dealing with complex interac- 
tions, some after a few microseconds of simula- 
tion If undetected, these problems would have 
seriously impeded progress toward our goal of 
providing functional lirst-pass hardware 

Vector Generation 

The start of system simulation takes place, by 
definition, near the end of the logic verification 
process. At about that time, we began to prepare 
for submittal of the designs for fabrication 
Therefore, in parallel with system simulation, 
test pattern generation was started 

Test vectors were needed at this time, primarily 
to test chips coming off the fabrication line. 
Therefore we generated test vectors for the very 
large channel-less arrays contained on each of our 
modules. The basic criterion for approval was 
attainment of 99 percent internal node toggle 
coverage of the gate array logic. In addition to the 
99 percent internal node toggle criteria, we also 
included the much more stringent criteria of 
95 percent stuck-at coverage as measured by a 
fault grading mechanism. The methods used to 
determine coverage are discussed in the section 
Problem Reporting and Resolution. 

The vectors were extracted from a strategic sub- 
set of our functional DVT simulation and graded 
on a hardware accelerator/fault evaluator. 

We set a goal that the vector count should not 
exceed the chip's gate count; that is, a chip with 
25K gates should have no more than 25K vectors 
to exercise its logic. The vectoring process, 
including extraction, grading, and complement- 
ing, took an average of one month pergate array. 

As is true of architectural verification, vector 
generation is an area where work remains to be 
done. If we had been able to include some testa- 
bility features in these very dense chips, we 
could have saved this month of schedule time 

Follow- through 

Even beyond the prototyping phase, the simula- 
tion database was maintained and updated to 



reflect any changes in the design as a result of 
hardware debug The purpose of this on-line soft 
representation of the design was twofold. First, 
the representation would aid in the isolation of 
any problems discovered in the lab. Second, the 
database could be used to investigate any suspi- 
cious problem areas that could not easily be trig- 
gered in the hardware 

Review and Reporting Methods 

Throughout the design verification process, a 
means to ensure coverage was established for 
each phase At the project outset. DVT speci- 
fication coverage of functions was assured by 
several levels of team review. As the simula- 
tions progressed, the project leaders were given 
the responsibility of ensuring bugs were consis- 
tently reported and corrected. Vector extraction 
and grading of our gate arrays provided a strong 
measure of the completeness of the verifica- 
tion of these chips. Additionally, the internal 
controllers to the gate arrays were measured for 
complete state and product term coverage. 
Lastly, before being released for manufac- 
ture, the design was checked against our own 
exit criteria to ensure that the verification 
was complete. 

This section presents details of these methods 
and tools for ensuring all functions were tested 
and verified 

Functional Coverage 

The VAX 6200 project team chose the functional 
verification approach to verify all VAX 6200 
designs. One problem with this approach is that 
there is no method of measuring functional cover- 
age. Since all verification is based upon the DVT 
specification, functional coverage will be a 
reflection of the completeness of this document. 
Therefore, the DVT specification becomes the 
vehicle by which the functional coverage of the 
verification is to be measured. This specification 
must be made as comprehensive as possible. 
Therefore, the specification underwent many lev- 
els of review by a wide audience, including the 
entire design team 

Problem Reporting and Resolution 
Another means used to ensure coverage was the 
problem-resolution and bug-reporting mecha- 
nism. Every design verification team project 
leader was responsible for tracking bugs in 
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the designs and ensuring these bugs were cor- 
rected and the correction was verified. Communi- 
cation for this tracking was through VAX NOTES 
conferences. 

For each design verification team, two confer- 
ences were created. The first was for bug report- 
ing and bug-fix resolution. Only verification team 
members could write notes in the bug confer- 
ence. Every entry indicated the date, model revi- 
sion levels, test case number, failure symptom, 
and any assessment of the problem. Replies to 
each entry were entered, either by the project 
leader or the CAE team member responsible for 
the failing test, to indicate when the bug was 
verified as being fixed and the model revision lev- 
els at the time of verification. If the problem 
remained unresolved, the reply would indicate 
any action taken or patches made. 

The NOTES conference review ensured that 
all bugs were given the proper attention and 
visibility. 

The second conference was informational. 
Using this conference, engineers could learn 
about key aspects of the design as the verification 
progressed. Fore example, they could obtain 
information on undocumented features on which 
certain verification tests were based . 

Fault Grading 

Another process, which was implemented to 
measure functional coverage of the component 
patterns, was the fault-grading mechanism. In 
this approach, all component patterns for the 
large compacted arrays were generated at the 
functional level. The simulation environment for 
pattern capture was the same one used for func- 
tional verification. The stimulus generated was 
driven by high-level functions. The test patterns 
were captured at the chip's boundaries whi le the 
chip was being exercised on the module. 

Traditionally, component patterns are gen- 
erated by simulating the chip standalone and 
driving hand-crafted stimulus through the chip 
simulation. Due to test overlap, the approach 
taken by the VAX 6200 team did not ensure the 
optimum number of patterns for the maximum 
stuck-at coverage. However, the approach proved 
to be very beneficial. Ranging from 20K to 50K 
patterns for each gate array, the patterns were 
generated in the very short time of approximately 
one month. In reaching our goal of 95 percent 
fault coverage with these test patterns, additional 



areas of logic were found that had not previously 
been tested. This additional logic yielded addi- 
tional bugs. 

The fault grading process also provided an 
additional degree of confidence in the coverage 
of the functional verification test cases. The 
95 percent fault coverage goal was achieved with 
patterns derived from a subset of those test cases. 
It should be mentioned that the hardware fault 
cvaluator was used extensively during this phase 
of the project and proved to be an irreplaceable 
tool. 

State Machine Coverage 
Tools were developed that would analyze traces 
generated from the internal gate-array controllers 
and sequencers. Traces were collected while the 
functional tests were being simulated and 
verified. All traces were later analyzed, and cover- 
age was ensured for every state and product term. 
This mechanism was put in place and automated, 
so that after each regression, coverage could be 
rechecked. 

After every regression run of all test cases, the 
results were analyzed to ensure that no product 
terms or states were missed as a result of test 
modification or bug fix. Additional test cases were 
generated to find specific and hard-to-activate 
conditions. 

Exit Criteria 

Before a design is sent to manufacturing, the 
design must meet the exit criteria. These criteria 
are as follows: 

■ All the specified test cases have been gener- 
ated and have run bug-free against the latest 
design. 

■ The system simulation has run bug-free for two 
continuous weeks. 

In other words, if bugs still exist in the design, 
the design is not yet ready for manufacture. 

As judged by the nearly bug-free condition 
of the implemented hardware, these design- 
completion criteria and coverage metrics were 
appropriate for the VAX 6200 development 
effort. 

The VAX 6200 project's tremendous success 
has established the process for future systems 
verification and for engineering quality measure- 
ment. 
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Results Attained 

The development cycle for the VAX 6200 system 
was quite short, and therefore the need to pro- 
duce functional first-pass hardware was very 
strong. The first XMI specification was released in 
December 1985 Eight months later, all of the 
XMI parts had been designed and simulated and 
were being manufactured Two months later, the 
parts were up and running 

During this time, specifications were released 
for the KA62A. MS62A, and DWMBA, and logic 
design was begun. Concurrently, test specifica- 
tion and test generation also began. In the late 
summer of I 986, all logic design was completed, 
and verification began Two to three months after 
design completion, verification was completed 
for each module. With a complete and verified 
design, one month was used to generate all gate- 
array test vectors and then submit the gate arrays 
for manufacture. 

In February 1987 — 14 months after the first 
complete XMI specification — the DWMBA was 
manufactured, powered on. and run with first- 
pass hardware. One month later, the KA62A was 
powered on and running. Two weeks later, with 
functional MS62As, the first VAX 6200 system was 
powered on. Two weeks after that, on April 1 , the 
first VAX 62 I 0 system booted VMS with all first- 
pass functional parts. 

Although a few bugs were later to be found and 
fixed, the goal of using simulation to generate 
hardware that works at speed the first time was 
attained. In fact, many of those original parts are 
being shipped with the VAX 6200 systems today. 

Opportunities for Improvement 

Although our verification process proved to be 
quite successful, we plan to make a few changes 
in this process for future projects. 

Architectural verification, in so far as that 
means an effort to discover system-level inade- 
quacies or bottlenecks, is in its infancy. We con- 
sider this a wide open area where much can be 
accomplished. 

As module designs call for increases in speed, 
timing verification and signal integrity verifica- 
tion will make a much larger contribution to the 
total verification effort. Although the XMI inter- 
connect was verified to all circuit, signal, and 
timing specifications, signal integrity was not 
emphasized to the same degree in the modules 
themselves. Although no significant problems 



arose, we became strongly aware that future 
generations of hardware will be much more 
dependent on the type of verification used for 
the XMI Although timing verification was 
performed on all gate arrays on the VAX 6200 
system, this verification. In the future, we feel it 
is important to perform timing verification on the 
design during early development. Thus we can 
identify and solve the timing problems before 
they become too entrenched in the design to be 
fixed easily. 

Since the wire delays for gate arrays can only be 
estimated until gate layout has taken place, all 
verification must be repeated once the actual tim- 
ing numbers are returned. Additionally, floor 
planning of the gate array can have a significant 
effect on the performance and specific wire 
delays. On the VAX 6200 project, the layout and 
final wire delay calculations were performed by 
our gate array vendor and then sent back to us for 
reverification. These steps can take quite a long 
time in the design cycle of a gate array. To reduce 
the wait for real wire delays, we plan to perform 
all floor planning and preliminary layout opera- 
tions at the design site. Additionally, this will 
allow us much more input to the floor plan and 
layout. 

Summary 

The success of the VAX 6200 verification effort 
can be attributed mainly to the decision to begin 
verification at the same time as the design and to 
continue verification and design as parallel 
efforts This decision was implemented by assem- 
bling verification teams at the same time design 
teams were being built. 

Verification was performed during each stage 
of development — from initial concept to system 
integration The architectural verification con- 
firmed the XMI architecture and arbitration 
algorithms. Performance verification helped 
define the processor and memory architectures 
and ensured that these architectures could take 
full advantage of the new XMI The logic of all 
XMI modules, their gate arrays, and the XMI arbi- 
tration logic was verified against their speci- 
fications, not against the designs themselves. 
Lastly, the entire VAX 6200 system was simulated 
in a multiprocessing environment, proving that 
the different component modules could function 
together as a system. Verification from system 
architecture to gate arrays, modules, and then 
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back to a complete system again, throughout the 
life of the project was the only way to assure the 
main verification goal — first-pass, functional 
hardware. 

During logic verification, attempts were made 
to perform verification using the smallest detail, 
while still keeping the scope of the logic under 
test large enough to allow system-level testing. 
By performing all testing at these much higher 
levels, a greater number of functions and more 
global functions can be tested at one time. The 
only drawback to testing at this level is simula- 
tion speed The trade-off of speed for accuracy is 
a good one, for without accuracy the costly alter- 
native is to design and manufacture multiple 
passes of hardware. 

In conclusion, the most important outcome of 
our verifi cation effort was a management philoso- 
phy that, in the end, verification is as important as 
logic design. With this understanding, verifica- 
tion criteria now determine when and whether 



designs are to be released for manufacture. To 
make this work successfully, the necessary 
resources must be allocated for the verification 
effort. Furthermore, project teams must develop 
and follow through with complete verification 
strategies These strategies focus on verification 
as a part of the total design process rather than as 
a process that takes place after designs are com- 
plete. The VAX 6200 project was proof that this 
philosophy can be made to work. 
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VMS Symmetric Multiprocessing 

The symmetric multiprocessing features of VMS version 50 effectively 
utilize the greater computing power of Digital's multiple CPU systems. 
Key to the SMP design is an innovative mechanism, called a spinlock, that 
provides a high degree of parallelism for kernel-mode code. Where for- 
merly VMS software used interrupt priority levels (lPLs) to synchronize 
processes, VMS now uses spinlocks. Because each VMS resource can be 
protected by a spinlock, this design provides more synchronization 
levels than could lPLs alone. Spinlock granularity directly affects system 
performance. 



This paper describes rhe major features of sym- 
metrical multiprocessing (SMP) in the VAX/VMS 
operating system. These enhancements are 
included in VAX,/VMS version 5.0. Although it is 
impossible to present details of every aspect of 
the SMP design in these few pages, this paper 
provides an overview of the key mechanisms 
developed for VMS SMP. 

Technology Developments 

Over the last several years advances in computer 
technology, especially in VLSI, have yielded 
greater computing power in increasingly smaller 
packages VLSI CPU chips have made possible 
multi-CPU. single-board computers These multi- 
ple CPU systems are having an increasing impact 
on the general-purpose computing environment. 
The net result is that recent technology trends 
have redirected the challenge of building multi- 
processing systems from the hardware engineers 
to the systems software engineers Systems soft- 
ware engineers must now design effective ways to 
utilize systems with six, eight, or even more 
CPUs. 

VAX Hardware Features Required by 
the VMS Operating System 
The VMS SMP design requires that certain fun- 
damental features be implemented in VAX multi- 
processing hardware. These features are as 
follows: 

■ The ability to share common memory among 
all CPUs in the system 

This shared memory allows all CPUs to execute 
a single copy of the operating system and to 



share state information that provides load bal- 
ancing capabilities 

■ An interprocessor interrupt capability that 
enables one CPU to interrupt all other CPUs or 
a single CPU 

■ The set of interlocked instructions (BBSSI, 
BBCCI, ADAWI, INSQxI, and REMQxl), which 
are part of the VAX architecture and thus 
present in every VAX system 

■ Cache coherency maintained by the hardware, 
without software assistance 

■ One CPU, known as the primary CPU. that 
must have access to all I/O. console sub- 
system, and timekeeping hardware 

With these hardware features. VMS can provide 
symmetric multiprocessing support for any VAX 
system. All code executing in user, supervisor, or 
executive mode can execute on any CPU without 
restriction. Most (if not all) kernel-mode code- 
can execute on any CPU without restriction. The 
only restricted code is that small amount of 
kernel mode code that requires access to the 
time-of-day internal processor register or to the 
console terminal and the console block storage 
device 

The SMP design has no requirement regarding 
the system topology or interconnect joining the 
multiple processors It supports systems imple- 
mented by means of a single bus architecture, 
such as the VAXBI bus. as easily as systems that 
use a cross-bar connection. 

Therefore, the VMS SMP design is flexible 
enough to support current VAX systems and 
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future VAX systems that take advantage of advanc- 
ing technologies and architectures. 

New Multiprocessing Hardware 
The design of recent VAX systems, such as the 
VAX 8800 and the VAX 6200 series of computers, 
offers an elegantly simple, symmetric hardware 
configuration. Central to the design of these sys- 
tems are two new bus architectures — the XMI 
bus and the VAXBI bus (Figure 1). The VAXBI 
architecture provides a protocol that allows (I ) 
multiple processors to issue device requests, and 
(2) operating system software to specify which 
processors a device controller will interrupt 

The symmetry of these I/O subsystems pre- 
sented a new challenge to the VMS SMP design- 
ers: to provide an I/O database design that would 
make possible simultaneous execution of inter- 
rupt handlers, thus taking advantage of these new 
hardware features. 

The Development of VMS SMP 

Critical to SMP was a new method, used through- 
out the VMS kernel, to synchronize multiple pro- 
cessors. One possible SMP design would have 
been to create a single lock for kerne I- mode 
operations and allow any processor to acquire 
that lock. However, the VMS engineers believed 
that such a design would not have provided 
sufficient parallelism to achieve good system 
throughput for systems with more than a few pro- 
cessors This single-lock method would have 



been a nonscalable solution; if more CPUs were 
added to the system, system performance would 
not increase due to blocking for the single lock. 

A more ambitious yet costly design was to 
provide a high degree of parallelism for kernel- 
mode code. With this kind of parallelism, many 
processors are allowed to execute different 
portions of the executive at the same time. For 
example, a process adding a system-wide logical 
name should be able to execute on one CPU 
while another CPU handles a device interrupt for 
completion of a disk I/O request, etc This design 
would require creation of numerous locks and 
careful design of the interactions between 
the critical regions that use those locks. This 
design approach was the one finally chosen by 
the VMS engineers, and is discussed in the follow- 
ing sections. 

Synchronization in VMS: 
Raising I PL, Mutexes, and Spinlocks 
The original VMS version 1 .0 design used two 
types of synchronization: (1) raising interrupt 
priority level (IPL) and (2) mutual exclusion 
semaphores (mutexes). The VAX architecture 
provides 31 IPLs; 1 through 15 are dedicated 
for use by software, and 16 through 31 are 
reserved for hardware (IPL 0 is not really an 
IPL but rather the level at which user, supervisor, 
and executive mode programs execute ) VMS 
blocked different types of system events by 
raising IPL to or above the level at which 
that event occurred. For example, process 
rescheduling was done by means of an IPL 3 
software interrupt Code threads that modified a 
process's context always executed at IPL 3 (or 
higher) to prevent a reschedule Another exam- 
ple is the manipulation of device controller regis- 
ters. These registers were always manipulated at 
the device's hardware interrupt level; thus other 
system activity of a lesser importance was 
blocked out while the time-critical code path 
was executed 

The second synchronization method, mutexes, 
was used to lock purely software constructs, such 
as global section descriptors Mutexes provided a 
mechanism for defining many locks without 
assigning a unique software IPL to each lock. A 
mutex was acquired by the operating system on 
behalf of a process and was considered "owned" 
by that process. Rescheduling could occur while 
a process "owned" a mutex; however, process 
deletion could not occur. Lock requests made by 
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a process of higher priority for an already owned 
mutex were handled by placing the requesting 
process into a wait state, thus avoiding dead- 
locks. 

In a multiprocessing system, each VAX CPU 
has its own interrupt priority level, independent 
of the others. Thus raising IPL would synchronize 
on a single CPU but not across the entire system. 
IPLs, then, could not be used to synchronize all 
CPUs. Neither were mutexes a viable solution, 
since they could only be used within process 
context and at low IPLs. Therefore, the SMP team 
created a new VMS mechanism that they termed a 
"spinlock." Anywhere VMS code had previously 
synchronized by raising IPL, the code would now 
acquire a spinlock; wherever VMS code had low- 
ered IPL., it would now release a spinlock Use of 
mutexes remained unchanged save that the code 
to acquire and release mutexes was protected by 
a spinlock. 

The design for spin locks included a number of 
critical concepts. First, a spinlock is "owned" by 
a CPU, not by a process (as mutexes are) Second, 
each spinlock is acquired and released at a par- 
ticular IPL that is associated with the spinlock. 
Raising IPL when a spinlock is acquired prevents 
Other activities from interrupting time-critical 
code Third. CPUs "spin-wait" when blocked 
from obtaining a spinlock resource held by 
another CPL1, since spinlocks arc only assigned to 
time-critical resources that cannot be locked for 
long periods of time Lastly, the design of spin- 
locks includes a mechanism for deadlock preven- 
tion or detection since the debugging of "hung" 
systems is too costly. Therefore, each spinlock 
is assigned a rank, because spinlocks must be 
acquired in order of rank, deadlocks are thus pre- 
vented Further, a debugging aid was built into 
the spinlock design A part of each spinlock data 
structure is set aside to hold the last eight pro- 
gram counters (PCs) that acquired or released 
each spinlock. When enabled, these consistency 
checks proved invaluable in determining interac- 
tions between different components in the VMS 
executive, such as memory management and 
scheduling 

The VMS engineers implemented routines for 
acquiring and releasing spinlocks rather than 
scatter in-line code through the VMS kernel. The 
first step in acquiring a spinlock is to synchronize 
the local processor by raising to the IPL of the 
spinlock, just as if it were a uniprocessor system. 
The actual locking of a spinlock is accomplished 



with an interlocked test-and-set memory opera- 
tion, the BBSS1 (Branch on Bit Set and Set Inter- 
locked) instruction. The spinlock interlock bit is 
contained in a separate byte within the spinlock 
structure. LInlocking a spinlock is done with the 
inverse BBCCI (Branch on Bit Clear and Clear 
Interlocked) instruction. These interlocked oper- 
ations are atomic memory transactions across all 
processors in a VAX multiprocessor configura- 
tion. Furthermore, since memory is common to 
all processors, the interlocked memory test-and- 
set operations provide a sufficient method of 
extending synchronization to all processors 
within a multiprocessor system 

The use of multiple IPLs as a synchronization 
method in VMS provides the capability to sched- 
ule events in a prioritized fashion The inclusion 
of IPLs in the spinlock structure allows the SMP 
synchronization mechanism to appear as an 
added dimension to IPLs. Moreover, this SMP 
mechanism preserves the ability to schedule 
events in a prioritized manner. 

For uniprocessor systems, the SMP design also 
includes the ability to optimize the routines that 
acquire and release spinlocks. For example, on a 
single CPU system, the spinlock acquire-and- 
release routines are never called. Instead, only a 
move-to-processor register (MTPR) instruction is 
executed, thus raising IPL. System performance 
of a single CPU has been measured as only a tiny 
percentage less than VMS version 4 performance 

Mutex synchronization is still the second syn- 
chronization method used in VMS In the SMP 
design, mutexes are used for locks that arc held 
for long periods of time and for situations in 
which the IPL has to be lowered Mutexes are still 
owned by processes, not by CPUs, under the SMP 
design. 

Spinlock Granularity, Devicelocks 
One aspect of the SMP design that directly affects 
system performance is the granularity of the spin- 
locks A coarse granularity (fewer spinlocks) is 
easy to implement and debug; however, a coarse 
granularity provides fewer synchronization 
points, and thus processors are blocked for 
longer periods A finer granularity (more spin- 
locks) provides more parallelism and thus 
shorter blocking times; however, a fine granular- 
ity is much more complicated to design and 
implement, and requires more synchronization 
points An important concept to remember is 
that, while the system is in a noncontending 
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situation, a synchronization point only adds un- 
necessary overhead. That is, if there is never any 
possibility of processors contending for the same 
resource, then synchronization is not required. 
Therefore, the SMP team decided that a manage- 
able number of spinlocks for the initial design 
was no more than 32. The SMP design provides 
designers the ability to create a finer granularity 
of locks in future releases of VMS as performance 
measurements identify time-critical resources. 

As the SMP development evolved, it became 
clear that a finer granularity of spinlocks for 
the I/O subsystem would be easy to implement. 
With multiple VAXBI buses, multiple CPUs could 
handle different device interrupts simulta- 
neously. This further improved the parallelism of 
the system and resulted in a new characteristic 
for spinlocks: dynamic versus static spinlocks. A 
static spinlock protects those resources common 
to all VAX/VMS systems. Therefore, static spin- 
locks are assembled into the VMS source code. 
Dynamic spinlocks synchronize device-specific 
code and so are created at boot time, depending 
upon the I/O configuration of the particular 
VAX system. Thus the number of dynamic spin- 
locks varies from system to system, whereas the 
number of static spinlocks is consistent across 
all systems. The dynamic spinlocks used to lock 
particular devices were named "devicelocks" to 
differentiate them from static spinlocks. A 
devicelock is used wherever device-specific code 
previously raised IPL to a device's IPL to block 
interrupts. 

Identifying Resources Requiring 
Spinlocks 

One of the first SMP development tasks was to 
identify each VMS resource that needed syn- 
chronization and then determine the proper lock- 
ing mechanism — spinlock, mutex, interlocked 
queue, etc. Once this work was complete, the 
added dimension provided by spinlocks allowed 
multiple resources to be protected by a single 
IPL For example, IPL 8 (SYNCH) had protected 
the following resources: memory management, 
scheduling, the I/O database, the file system, and 
the timer queue. By adding a new dimension, 
namely spinlocks, each of these resources could 
be protected by a different spinlock but share the 
same IPL. Therefore, in a multiprocessor configu- 
ration, it was now possible to run more than one 
processor at the same IPL. However, the proces- 
sors must be executing different critical regions 



of code The spinlock design, therefore, has the 
advantage of providing more synchronization lev- 
els than could be provided by IPLs alone. Hence, 
the granularity of spinlocks can be much finer 
than that allowed by software IPLs alone. This 
finer granularity in turn provides more concur- 
rency of execution in the VMS kernel . 

For example, IPL SYNCH had protected a large 
number of resources and thus would be a good 
candidate for a finer granularity of spinlocks 
Where VMS code had previously raised IPL to 
SYNCH, the SMP team had to determine which 
spinlocks had to be acquired and then perform 
the conversion. 

In summary, IPL SYNCH became the following 
spinlocks: 

FILSYS File system structures (such as 
file control blocks) 

IOLOCK8 Fork IPL 8 (map registers, data 
paths and System Communication 
Services resources) 

TIMER Timer queue 

MMG Memory management, page de- 

scription database, swapper, and 
modified page writer 

JIB Portions of the job information 

block 

SCHED Process control blocks, schedul- 
ing database, acquisition/release 
of mutexes 

Per- CPU Context Areas and 
Interrupt Stacks 

Another development task was to identify the 
context that had to be maintained for each pro- 
cessor — independent of the general system 
structures. This "per-CPU" context area had to 
include such items as identification of the current 
process, a unique CPU identification field, and 
CPU specific work queues. In addition, design 
requirements specified that a processor be able 
to locate its private CPU context area with mini- 
mal overhead. 

The easiest solution would have been to 
include an internal processor register (IPR) into 
which software could load the virtual address of 
the context area. Since IPRs are part of the pro- 
cessor hardware, each CPU could have pointed to 
its own context area without confusion. How- 
ever, such a processor register did not exist in the 
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VAX architecture. Therefore another solution was 
needed in order for SMP to execute on existing 
VAX systems 

A creative alternative to inventing a new 1PR 
was to find a method to use an existing IPR 
for multiple purposes. The VAX architecture 
includes an interrupt stack pointer (ISP) which 
software loads with the virtual address of the 
interrupt stack Since each processor must have 
its own stack for handling interrupts, this area 
was already CPU-specific. Under the SMP design, 
the interrupt stack area and the CPU context area 
are treated as one virtually contiguous context 
block When the virtual address of this new con- 
text area is rounded to an appropriate power of 
two, a simple clearing of the low order bits of the 
virtual address of the ISP yields the base address 
of the private CPU context area. 

This solution provided two similar ways to find 
the private CPU context area: 

MFPR *PRS_ISP,Rx 
BICL "mask.Rx 

or 

BICL3 "mask.SP.Rx (when running on the 
interrupt stack) 

Both methods return the virtual address of the 
private CPU context area. However, the latter 
case provides the faster mechanism. 

Translation Buffer Invalidation — 
A Form of Cache Coherency 
As was already mentioned, the VMS SMP design 
required that cache coherency be maintained in 
the hardware. However, the VAX architecture 
includes one hardware cache that is maintained 
by software, the translation buffer. The transla- 
tion buffer caches page table entries (PTEs) to 
speed up address translation from virtual to phys- 
ical memory addresses. 

Software monitoring of the translation buffer is 
appropriate for two reasons Since page table 
pages are only "virtually contiguous" and not 
"physically contiguous" portions of VAX main 
memory, monitoring changes to the PTEs would 
be difficult for hardware. Also, since modification 
of page table contents is usually an infrequent 
event, this cache is more suitably maintained by 
the software. 

Therefore, as part of its monitoring function, 
the operating system software must notify the 
processor whenever it changes the contents of a 



PTE. in case the PTE is cached in the translation 
buffer. This notification is called a translation 
buffer invalidation request and is accomplished 
by a write to an IPR. Since PTEs can be cached on 
any processor in a multiprocessor system, one 
possible implementation would be for all CPUs 
to perform a translation buffer invalidation 
request when any PTE is changed. Since transla- 
tion buffer invalidation must be carefully coordi- 
nated among all CPUs, however, this simple 
approach would have significantly affected sys- 
tem performance if left unmodified. 

Two other features of VAX/VMS memory man- 
agement play significant roles in the design for 
translation buffer invalidation in the SMP envi- 
ronment. First, a user-process address space can- 
not be executing on multiple processors simulta- 
neously. Second, the cached user-process PTEs 
are invalidated when a LDPCTX (load process 
context) instruction is executed as part of pro- 
cess rescheduling. 

Using these features, engineers optimized the 
design to require system-wide translation buffer 
invalidation only for system address space and 
not for user address space. Since system addresses 
change less frequently than user space addresses, 
this new design allowed for a major reduction in 
the interprocessor communication traffic 

Process Affinity 

Certain operations in a multiprocessor system 
must execute on particular CPUs The VMS SMP 
designers termed the binding of a process to a 
particular CPU as "process affinity." Affinity for a 
process is implemented by means of a 32-bit 
mask (one bit per CPU) in the process control 
block (PCB). Once a process is assigned affinity, 
the process may only execute on CPUs for which 
it has affinity. Process affinity is enforced by the 
VMS scheduler during a reschedule event (Note 
that only for real-time priority processes does 
VMS SMP guarantee to run the N-highest priority 
processes on an N-processor system.) 

The VMS SMP design currently implements two 
levels of process affinity: hard affinity and capa- 
bilities. Hard affinity forces selection of a single 
CPU in the affinity mask. This level of affinity is 
used when a process must be guaranteed execu- 
tion on a particular CPU, which is specified 
by the CPU identification field in the PCB. 
Specifically, hard affinity is used to implement 
CPU diagnostics and to halt a CPU When hard 
affinity is being enforced, the process affinity 
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mask is reduced to a single bit. which represents 
the one CPU on which the process may execute 
The selection of hard affinity is a very static opera- 
tion. The selection of which CPU to run on is 
determined prior to scheduling the process, and 
the selection remains enforced until otherwise 
requested 

Capabilities provide a logical mapping of pro- 
cesses to services These services may only be 
available on certain CPUs in the SMP environ 
mcnt; for example, primariness is a logical capa- 
bility A capability may be serviced by one or 
more CPUs in the SMP environment For exam- 
ple, primariness is a capability that is only 
offered by at most one CPU in the SMP environ- 
ment. 

When a process requires capabilities, the pro- 
cess indicates the desired capabilities in a 32-bit 
mask in the PCB. When the process is scheduled, 
a comparison is made of the current requested 
capabilities and the capabilities offered by the 
CPU being rescheduled. If the CPU has the 
required capabilities, then the process is exe- 
cuted; otherwise, the process is ignored and 
another process is chosen for execution Any 
active CPU offering a particular capability may 
service any process requiring that capability 
Once the capability is no longer required by a 
process, the capability bit in the PCB is cleared 
and the process can execute on any CPU in the 
multiprocessing system. Thus, capabilities offer a 
much more dynamic load-leveling of processes 
across the CPUs in the system than does hard 
affinity. 

Device Affinity 

The VMS SMP design requires that the primary 
CPU have access to all I/O devices on the system 
Due to hardware asymmetry for certain devices in 
some existing multiprocessing systems, the VMS 
SMP design also had to include provisions for 
device affinity. For example, usually both devices 
in the console subsystem — the console terminal 
and the console block storage device — can only 
be accessed by the primary CPU. This is espe- 
cially evident on 8300 systems, where a physical 
backplane cable connection from one of the 
VAXBI slots (usually slot 2, which contains the 
primary CPU) limits access to the console sub- 
system to the primary CPU. 

Device affinity models the hardware asymmetry 
by allowing only a subset of the processors to 
access these I/O devices Only the portions of 



VMS software that access the hardware itself 
(such as device driver routines that alter control 
and status registers) must execute on one of the 
CPUs in the device affinity set for that device. For 
example, most of the initial processing of a SQIO 
request can execute on any CPU. The driver code 
actually starts the I/O transfer by controlling the 
device by means of the control and status regis- 
ters Only this portion of the driver code must 
execute on a member of that device's affinity set 

Under the SMP design, all forking and postpro- 
cessing occur on the same CPU that received 
the device interrupt. The device affinity imple- 
mentation uses a "trickle down" method that 
requires no affinity checks for any of the queues 
Instead, fork threads are queued to the appropri- 
ate CPU in the first place. The SMP implementa- 
tion queues the fork threads by replicating the 
I/O postprocessing queue and the fork queues 
for each CPU in the per-CPU context area Thus 
each CPU can process its own fork and I/O post- 
processing queue without acquiring the various 
spinlocks that would be required for system-wide 
queues. Further, under this design, the set of 
CPUs to which a particular device is bound under 
device affinity is a proper subset of the CPUs that 
can service interrupts for that device. 

The affinity field for a device is stored as a bit 
mask in the unit control block (in the field 

UCISI AFFINITY) This bit mask represents 

those CPUs that are allowed to access the 
specified device. The default value for 
UCBSL_ AFFINITY is — I , allowing access from 
any CPU to the device As already mentioned, 
the console subsystem devices are accessible 
only from the primary CPU; therefore, the 
UCBSL_ AFFINITY mask for these devices is ini- 
tialized tothe primary CPUonly 

The afiinity field for a device is checked on 
entry to only two of the seven driver entry points: 

■ STARTIO 

■ ALT_STARTIO 

If the affinity check fails, the I/O request packet 
(IRP) is queued as a fork block to another CPU 
from which access is allowed. The fork block in 
the CDRP portion of the IRP is used to fork the 
request to another CPU. The fork block is queued 
to a work request queue in the selected CPU's 
per-CPU context area. An interprocessor inter- 
rupt is then delivered to notify the CPU that work 
is now present in its work request queue 
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All other entry points into device drivers are 
serviced by the primary CPU. which must be 
guaranteed access to all devices These entry 
points are normally cal led only during device ini- 
tialization and include the following entry 
points: 

■ TIMEOUT 

■ UNIT1NIT 

■ CONTROLLFR IIMIT 

■ CLONED LICB 

■ UNIT DELIVERY 

Process affinity is used to provide the device 
affinity requirements for the SCANCEL system 
service. When the SCANCEL request is serviced, 
the L.!CBSL_AFF1NITY field may not allow access 
from the CPL' on which the request was initiated 
If access is not allowed, then the process affinity 
is changed to force the process to execute on a 
CPL' compatible with the affinity requirements of 
the device. 

Some VMS routines are always called when 
I/O completes on the same processor that ser- 
viced the device and fork level interrupt dis- 
patching Therefore, device affinity is implicit for 
these routines, and no affinity checks are made 
prior to calling the routines REGISTER DL1MP 
and MOUNT VERIFICATION 

Future Investigations 

The initial VMS SMP design is finished, but many 
interesting areas invite further investigation. 
These include 

■ Performance improvements, perhaps finer 
granularity spinlocks 

■ Enhancements for parallel processing 

■ Prov isions for higher availability 

The key to the VMS SMP design is the new syn- 
chronization primitives, that is, spinlocks. The 
flexibility of the spinlock design will be impor- 
tant in future enhancements to SMP, as already 
proven in the evolution from static to dynamic 
spinlocks. 

Granularity is another important attribute of 
spinlocks. which are synchronization points. All 
synchronization points must be factored into the 
design of any multiprocessor system. Each spin- 
lock represents at most a single thread of execu- 



tion. Therefore, each section of code protected 
by a spinlock can be executed by only one pro- 
cessor at a time If two processors attempt to 
access the same section of code (termed a critical 
region), then only one processor will proceed 
while the other(s) spin-waits. To restate 
Amdahl's Law: You cannot get more than one 
CPU's worth of work out of any synchronization 
point. 

The ability to increase the number of spinlocks 
should prove invaluable in future enhancements 
to SMP. as performance measurements indicate 
which spinlocks need to change their granularity. 
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Performance Evaluation of the 
VAX 6200 Systems 

Performance evaluation is an essential element in the development of a 
computer system. An effort was made to accurately evaluate the perfor- 
mance of the VAX 6200 system under workloads that represent real 
customer environments. Workloads were developed to represent three 
major target markets — Engineering/Scientific, Commercial, and Gen- 
eral Timesharing. These workloads were used to drive the VAX 6200 sys- 
tems and thus to evaluate system performance in these environments. 
Performance measurement results indicate that the VAX 6200 system is a 
well-balanced multiprocessor system and that the multiprocessor perfor- 
mance is fairly linear across these workloads. 



Introduction 

The VAX 6241) system is a rightly coupled multi- 
processor system based on the CVAX micro- 
processor. The system consists of four processors 
sharing memory through a single, high-speed 
bus. This paper describes the process by which 
performance of the VAX 6240 system was evalu- 
ated under various workloads that represent 
target markets. The method used to develop and 
verify these workloads is discussed along with 
the evaluation of system performance. We use 
the multiprocessor efficiency measure, delined 
as the relative throughput obtained by the addi- 
tion of each processor, to characterize multi- 
processor performance. Measurement of the 
VAX 6240 system indicates that the multiproces- 
sor etficiency measure is directly dependent on 
the contention for shared resources generated by 
a workload. 

Workload Development 

One of the major issues in evaluating the perfor- 
mance of a computer system has been in the 
workload area. In the context of this paper, 
workloads are software tools used to create inter- 
active multiuser environments in which the 
interactive throughput and responsiveness of the 
system are the key performance metrics Con- 
versely, benchmarks are either single or multiple 
copies of programs run in batch mode; the 
amount of time to complete execution of these 
programs is the performance metric The ques- 



tion continually debated is how well the bench- 
marks and workloads represent current user envi- 
ronments. Since there are many different kinds of 
computing environments and both the applica- 
tions and computing styles are continually chang- 
ing, it is very difficult to develop representative 
workloads accurately. The approach taken here 
was to first survey the current customer popula- 
tion and identify a few major target markets. 
Table 1 consists of three surveys obtained from 
different sources, with n being the sample size. 



Table 1 Survey of Customer Environments 



Environment 


Survey 1 
n =110 


Survey 2 
n = 200 


Survey 3 
n = 55K 


Engineering/ 


46% 


50% 


31% 


Scientific 








Commercial 


40% 


23% 


35% 


Education 


8% 


15% 


8% 


Software 


6% 


12% 


4% 


Development 








Miscellaneous 






11% 



Table 2 Distribution of Customer 
Environments 

Engineering/Scientific 40% 
Commercial 40% 
General Timesharing 20% 



64 



Digital Technical Journal 

No. 7 August 1988 



CVAX -based 
Systems 



PROCESSOR 1 




PROCESSOR 2 




PROCESSOR 3 






STREAM 1 




STREAM 2 




STREAM 3 




STREAM 4 


STREAMS RUNNING SIMULTANEOUSLY ON THE PROCESSORS 



THROUGHPUT = NUMBER OF JOBS COMPLETED WITH MULTIPLE PROCESSORS 
AS COMPARED TO ONE 



Figure 1 Execution of Multiple Programs Run in Parallel 



Clearly, Engineering/Scientific and Commer- 
cial environments dominate the market, with 
Education, Software Development, and General 
Timesharing applications accounting for the rest. 
Further examination of the Software Develop- 
ment and the Education environments showed 
much similarity in function, except that Software 
Development is slightly more compute intensive. 
Thus we further simplified the application cate- 
gories, as shown in Table 2 

We identified typical environments in each of 
these categories by evaluating system resource 
consumption in these environments rather than 
by evaluating what an end user does on the sys- 
tem. Thus we could simplify the number of 
parameters to CPU, memory, and I/O resource 
utilizations. Having identified these typical envi- 
ronments, we collected or developed bench- 
marks and workloads to represent them. 

Single Stream 

Acquiring single stream benchmarks was not as 
difficult as developing multiuser workloads Most 
of Digital's customers have benchmarks that 
represent their environments. Therefore, we 
acquired a collection of benchmarks to represent 
Engineering/Scientific, Commercial, and General 
Timesharing from various customer sites These 
benchmarks are used to evaluate the single- 
processor speed. 

Multistream Batch Jobs 
A stream of well-known benchmarks was selected 
that represented each of the above-mentioned 
Engineering/Scientific, Commercial, and General 
Timesharing markets. 

■ The engineering stream consists of typical 
programs used in electrical circuit simulation, 



oil reservoir simulation, flight simulation, and 
linear equation solvers. 

■ The scientific stream contains simulation 
programs that use Monte Carlo techniques 
to track particle movement, along with 
commonly used routines from national labora- 
tories 

■ The commercial stream contains the activities 
done by a personnel department to support 
salary planning. 

■ The general timesharing stream represents the 
activities done in a software development or 
education environment 

Multiple copies of this stream were run simulta- 
neously to take advantage of multiprocessor com- 
pute resources (Figure 1) . To capture the maxi- 
mum throughput, we ensured that all of the 
processors were 100 percent busy while the mul- 
tiple streams were runningon the system. 

Multiuser Workload Development 
The overall process of workload development is 
shown in Figure 2. Our goal was to represent typ- 
ical timesharing environments for the different 
target markets. The entire strategy consisted of 

■ Identifying typical real sites 

■ Collecting data on resource utilization and 
image usage patterns 

■ Deriving a packaged workload to represent the 
real site environment 

■ Validating the workloads by comparing the 
resource utilization of the workload against 
the resource utilization at various customer 
sites and modifying the workloads as required 
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Figure 2 Interactive Multiuser Workload Development 



In the following sections, we describe how we 
used this strategy to develop two multiuser 
workloads: the engineering workload, which rep- 
resents an Electronic Computer-Aided Engineer- 
ing environment (ECAE); and the Software Devel- 
opment Environment Workload (SDEW) . 

Data Collection 

Two Digital sites were chosen to represent the 
ECAE and SDEW environments. Internal sites 
were chosen initially to facilitate the data collec- 
tion process. Both sites had clustered environ- 
ments that consisted of a variety of VAX systems 
along with some workstations. 

We collected information on these clustered 
systems to capture their behavior under the load 
generated by the environment over a period of 
one week. VAX SPM software was used to collect 
resource utilization data (CPU, I/O, and memory 
utilization) on all the systems at both user level 
and system level. VMS Image Accounting was 
used to obtain resource utilization data on an 
image basis. Using the SET HOST/LOG Digital 
Command Language (DCL) command, we col- 
lected log files of user sessions to study user 
habits. Other user characteristics, such as think 
time and type rates, were obtained through inter- 
views and observations. 

Data Analysis 

The performance team studied the cluster-wide 
resource utilization profiles in order to select the 
time when the interactive activities were pre- 
dominant. We compared resource utilization 
profiles of individual systems against the cluster- 



wide average over a week's accumulation of data. 
Based on this comparison, we selected a typical 
day and a typical system. One hour was chosen 
from the typical system on a typical day during 
the period of peak interactive use to characterize 
the system at full load. 

Further, based on the user profiles, we 
classified users according to computer usage, 
that is, heavy or light computing (for ECAE 
workload) and heavy, medium, or light comput- 
ing (for SDEW workload). We then used the 
image accounting data and user log fi les to clas- 
sify users according to the type of activity they 
performed. 

Once several user classes were identified, the 
number of users in each class, or user mix, was 
determined. We defined the user mix by looking 
at (1) the number of users in each class at the 



Table 3 ECAE and SDEW User Mix 



ECAE User Mix 


Type of User 


No. of Users 


Engineer: Heavy 


3 


Engineer: Light 


3 


SDEW User Mix 


Type of User 


No. of Users 


Heavy software development 


1 


Light software development 


3 


Secretary 


1 


Technical writer 


1 
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one-hour peak, and (2) the organization struc- 
ture at the real sites. Table 3 shows the user mix 
for ECAE and SDEW workloads. In addition to 
interactive users, these workloads also have batch 
jobs running in the background. 

Developing the Workload 

Having identified the user classes and activities, 
we then developed an intermediate workload 
using DCL command procedures. This inter- 
mediate step allowed easier translation to the 
final workload, which was based on VAXRTE 
(VAX/VMS Remote Terminal Emulator) scripts. 
Individual user scripts were developed and vali- 
dated We then packaged the entire workload by 
integrating all of the user scripts and the batch 
jobs. Once development was complete, the 
workload was validated at both system and user 
levels against the real internal site. Further vali- 
dation was done at the user level against Digital's 
customer sites. 

Workload Validation 

This section describes the workload validation 
process using the ECAE workload as an example 
of the validation methodology. 

Validation against "real" internal site — The 
workload was tested using the same hardware 
configuration as the real system. For the ECAE 
workload, a VAX- 11/780 system with 32 mega- 
bytes (MB) of memory, RA81 disks, and six inter- 
active users was tested. The purpose of this test 
was to compare the resource utilization of the 
workload in an hour-long experiment to the 
resource utilization of the real system during the 
typical hour. System- and process-level resource 
utilization data of several different resources 
were compared. 

User-level validation — To validate the work- 
load at the user level, we compared the average 
CPU and direct I/O (DIO) utilizations computed 
for 1 hour for the different user classes. The 
results are shown in Table 4. 

CPU utilization for all three user classes vali 
dated to within approximately 1 0 percent, 
which was considered to be well within accept- 
able limits. Validation of the DIO rate was made 
somewhat difficult because ( 1 ) the DIO rate on a 
per-user basis was very low (0.3 DIO per second 
for the heavy user), and (2) measurement of the 
DIO rate is only accurate to 0. 1 DIO per second. 
For all three user classes, the workload came to 



Table 4 User Resource Utilization for Real 

Internal System and ECAE Workload 



CPU 

minutes/hour DIO/second 



User Class 


Real 


ECAE 


Real 


ECAE 


Heavy 


1.6 


1.5 


0.3 


0.4 


Light 


0.5 


0.5 


0.2 


0.1 


Batch 


42.8 


48.5 


0.0 


0.1 



within 0.1 DIO per second of the values mea- 
sured from the real site. 

System-level validation — For system-level val- 
idation, we compared the system-level usage of 
CPU, disk I/O, and memory for the 1-hour ECAE 
test experiment to the peak hour of the real 
system. Figure 3 shows that the CPU was used 
1 00 percent of the time on the real system during 
the 1 hour; whereas the CPU utilization in the 
workload tended to vary slightly more, but was 
always between 90 percent and 100 percent sat- 
urated. The average CPU utilizations of the real 
system and the ECAE workload are very close at 
100 percent and 93 percent, respectively. 

The DIO utilization over a 1-hour period for 
the two systems is compared in Figure 4 . For both 
systems there is significant variability in the DIO 
rate over the 1 hour period. The ECAE workload 
was slightly more bursty, but the average DIO 
rates for the real system and the ECAE workload 
were very close at 3 3 and 3 0 DIO operations per 
second, respectively. 

Memory utilization on the two systems did not 
vary substantially over the 1-hour period. How- 
ever, total average memory usage with the 
workload, 23MB, was less than on the real sys- 
tem, 29MB, as depicted in Figure 5. 

Although the workload validated very well for 
CPU and DIO resource utilization, the workload 
used 20 percent less memory than was used at 
the real site. This was in part due to the fact that 
during the development of the workload the CPU 
and disk I/O utilization of subprocesses was 
added to the resource utilization of the parent 
process. Although the workload represents the 
work done by those subprocesses and the load 
placed on CPU and disk I/O resources, the 
workload does not represent the additional mem- 
ory required by those subprocesses. As will be 
described in subsequent sections, the lower 
memory utilization of the workload did not con- 
stitute a problem. 
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Figure 3 CPU Utilization for Real Internal System and ECAE Workload for 1 Hour 
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Figure 4 DIO Utilization for Real Internal System and ECAE Workload for 1 Hour 
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Figure 5 Memory Utilization for Real Internal System and ECAE Workload for 1 Hour 
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A summary of the comparisons of the average 
resource utilizations for the real system and the 
workload is presented in Table 5. 

Validation against customer sites — This vali- 
dation of the workload against the internal system 
was followed by validation against customer 
systems. The goal of this additional validation 
was to determine if the workload was representa- 
tive of the load placed on systems by Digital's 
customers. 

Two semiconductor manufacturers in Califor- 
nia were used as validation sites for the ECAE 
workload. Initially, it was determined that there 
were significant differences between the work 
performed at these customer sites and the work 
performed at the internal Digital site. The Digital 
internal VAX systems were used for logic design 
of gate arrays, circuit boards, and systems; 
whereas at the external sites, the VAX systems 
were used for the design of integrated circuits. 
Specifically, the work differed in the following 
ways: 

■ DECSIM is used extensively within Digital, 
whereas SPICE is the predominant simulation 
software used by external semiconductor 
developers. DECSIM simulations require very 
large amounts of memory as compared to the 
SPICE simulations done by customers. 

■ Design rule checking is both a time-critical 
and disk l/O-intensive task done by semicon- 
ductor designers. Design rule checking and 
the load it places on the I/O subsystem were 
not executed at the internal Digital site at the 
time resource utilization data was collected. 

As a result, we modified the ECAE workload to 
include the load placed on the system by design 
rule checking and replaced the use of DECSIM 
with SPICE. 

System resource utilization data was collected 
on VAX 8800 systems for one week at these cus- 
tomer sites In a manner very similar to the pro- 
cess used for the initial development of the 
workload, the data from these sites was reduced 
to a typical peak period. Table 6 presents the 
comparison of resource utilization on a per user 
basis in the workload and at customer sites. 

The ECAE workload falls within the range of 
utilizations observed at these customer sites for 
both disk and memory utilizations. The workload 
is slightly (approximately 10 percent) more CPU 
intensive on a per-user basis than was observed at 



Table 5 System-Level Resource Utilization 
for Real Internal System and 
ECAE Workload 



Resource 


ECAE 


Real System 


CPU busy 


93% 


100% 


DIO/second 


3.0 


3.3 


Memory 


23MB 


29MB 



Table 6 Comparison of Resource Utilization 
on Customer System and 
in ECAE Workload 



Resource Utilization 


Customer 


ECAE 


per Hour 


Sites 


Workload 


CPU (minutes/hour) 


3.8- 


-4.5 


5.0 


DIO operations/second 


1.4- 


-2.3 


1.8 


Memory (MB) 


0.7- 


-0.8 


0.8 



customer sites. This workload will put a 10 per- 
cent heavier load on the system, making the per- 
formance numbers slightly conservative for the 
computer aided electrical engineering market. 

Performance Measurement and 
Analysis 

This section discusses the performance of the sys- 
tems in three major applications: Engineering/ 
Scientific, Commercial, and General Time- 
sharing. In each of the environments, single 
stream, multistream, batch, and multiuser work- 
loads were tested. 

Single-Stream Performance 
The first step in evaluating the performance of a 
multiprocessor system is to establish the base- 
level performance of the uniprocessor relative to 
a well-known system such as the VAX-1 1/780. A 
large number of single-user benchmarks were 
used to establish this base level. 

Single- User Performance 

Single-user performance was evaluated by using 
traditional synthetic benchmarks, well-known 
industry standards, and real application programs 
from engineering, scientific, commercial, and 
general timesharing environments. Most of the 
synthetic benchmarks are in FORTRAN; industry 
standards are Whetstones, Dhrystones, Linpack, 
and others. The real applications, as mentioned, 
represent four environments 
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Figure 6 Frequency Distribution of the VAX 6210 Performance 
on the Single User Benchmark Set 



These benchmarks were used to evaluate 
uniprocessor speed compared t« a VAX J 1/780 
system A frequency distribution of the speedup 
factors on all these benchmarks was plotted, and 
the central tendency was examined. (Sec Fig- 
ure 6 ) A high percentage of the benchmarks fell 
between 2 2 and 2.8 

Table 7 summarizes the performance of the 
VAX 62 1 0 in the single-user environment relative 
to a VAX-1 1/780 system The performance aver- 
age of the VAX 6210 system, across all these 
benchmarks, is 2.8 times the performance of a 
VAX- 1 1 /780 system 

Decomposed Single- user Performance 
VAX 6200 performance on decomposed pro- 
grams was evaluated through the use of manual 
and directed decomposition techniques. To 
begin with, a program is evaluated to see if some 



Table 7 Performance of the VAX 621 0 in the 



Single-User Environment 



Synthetic Benchmark Set: 




Single-user set 


2.5 


Industry-standard Benchmarks: 




Whet-s & -d 


2.3 


Linpack-s 


2.7 


Linpack-d 


3.2 


Dhrystone 


2.8 


Real Application Benchmark Set: 




Engineering set 


2.8 


Scientific set 


2.6 



segments can be separated into parallel threads 
that can be run independently. Then the program 
is decomposed and run. either manually or 
through directives. The program is initiated as a 
single job; then the segments of the program that 
lend themselves to decomposition are divided 
into subproccsses and executed in parallel on 
different processors. In the manual decomposi- 
tion method, the optimal number of subpro- 
ccsses for various levels of multiprocessor sys- 
tems is evaluated by varying the number of 
subprocesses and calculating the speedup fac- 
tors In the directive decomposition method, the 
compiler takes care of various optimization fac- 
tors These programs were run standalone with 
no interference from any other programs on the 
system Figure 7 illustrates the decomposition 
process. 

The benchmark description is as follows. To 
evaluate the maximum speedup factors that can 
be achieved through decomposition, code seg- 
ments were selected. Such segments as matrix 
multiplication and convolution are widely used 
in engineering/scientific applications. Different 
array sizes (from 100 to 1000) were used with 
various arithmetic data types such as integer, and 
single and double precision. 

An image processing program and the Lin- 
pack 1 000D program were used to represent real 
application programs, where only certain seg- 
ments can be decomposed. 

The performance results are as follows. The 
multiprocessor efficiency measure, defined as the 
relative speedup obtained by the addition of each 
processor, is the key metric used here to evaluate 
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ONE PROGRAM DECOMPOSED INTO PARALLEL CODE 



RUNNING ON: 

PROCESSOR 1 




SPEED = TIME TAKEN TO 
COMPLETE THE JOB 



Figure 7 Program Decomposition Process 



performance. As seen in Figure 8, the multipro- 
cessor efficiency measure on the program kernels 
is fairly linear. Multiprocessor synchronization is 
minimal in this computing environment The 
performance was very close to the theoretical 
maximum. A speedup of 3 9 times the uniproces- 
sor performance was achieved on the four- 
processor 6240 system. The performance on the 
image processing program is slightly lower than 
what was observed on the program kernels Thus 
performance gained by decomposition depends 
directly on the amount of code that can be run in 
parallel (Note: On the Unpack 1000D program, 
directed decomposition was used; whereas on 
the other programs, manual decomposition was 
used.) 

Multistream Batch Performance 
Measurement and Analysis 
The multistream jobs were used to measure the 
system-level batch performance on the multipro- 
cessor systems. As shown in Figure 1 , these multi- 
ple streams were run in parallel to allow concur- 
rency in the execution of these streams. 
Maximum concurrency is achieved since each of 
these streams is identical. No single stream runs 
any faster, however, the number of jobs com- 
pleted increases almost linearly with the addition 
of processors Adequate memory was allocated to 
the jobs to avoid unnecessary paging and swap- 
ping. In addition, sufficient I/O resources were 
present on the system to preclude I/O bottle- 
necks. The elapsed time to complete these jobs 



was recorded and used to evaluate the multipro- 
cessor batch throughput performance. It is 
important that all the streams run simultaneously 
and share resources equally Large differences in 
the completion timesof streams would imply that 
maximum concurrency was not achieved because 
of some bottleneck in the system. 

Multiprocessor performance on multistream 
batch jobs was very close to linear across all envi- 
ronments. Results for the commercial stream, 
representing personnel administration, were only 
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Figure 8 Multiprocessor Efficiency through 
Parallel Processing 
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slightly lower — probably because of the higher 
amount of I/O on this stream. (See Figure 9.) 

Interactive Multiuser Performance 
Measurement and Analysis 
In the interactive multiuser environment, the 
system must support the activities of a substan- 
tially higher number of users and their frequent 
interaction with the system. The number of users 
on the system increases the amount of context 
switching, and the contention for shared 
resources is also much higher in this environ- 
ment. 

The test methodology included the use of a 
remote terminal emulator, VAXRTE, to create the 
interactive multiuser environment. The VAXRTE 
generated the input for the system under test and 
received consequent output. The VAXRTE also 
logged and time-stamped all interactions and 
maintained the job mix throughout the experi- 
ment. To run a multiuser experiment, the system 
under test and the VAXRTE system were booted 
and running. Using scripts, every few seconds the 
VAXRTE logged a user on to the system under 
test. After all logins were completed, sufficient 
time was allowed for the system to reach a steady 
state. The experiment was then run long enough 
to execute the longest script cycle for the 
specific workload. While the experiment was 
running, VMS monitor and other monitoring tools 
were used to capture the resource utilization 
data When the experiment was completed, data 
was reduced and analyzed. 

Workload description — Three interactive 
multiuser workloads were used to evaluate the 
multiprocessor performance in the three major 
environments: Engineering, Commercial, and 
General Timesharing. 

The Engineering environment was represented 
by an ECAE workload. This workload consists of 
the types of tasks done by design engineers devel- 
oping electronic circuits: circuit simulation, 
design rule checking, schematic file transfers 
from workstations, and tasks supported by VMS 
utilities. 

The multiuser Commercial (Compu -Share) 
workload is based on the Compu-Share Order 
Processing software package. This workload con- 
sists of three major types of transactions: order 
entry, order inquiry, and accounts receivable 
reporting. 

The General Timesharing SDEW represents the 
types of tasks done by software engineers. The 



major tasks executed in this workload are com- 
pile-link-execute-debug cycle using FORTRAN, 
BLISS, and MACRO; utilities used include CMS, 
RUNOFF, and text editors. 

Hardware/software setup — Table 8 summa- 
rizes the hardware and software configurations. 
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Figure 9 Multistream Efficiency Measures 

Table 8 Summary of Hardware and Software 
Configurations 



Hardware Configuration 



Processor 


VAX 6240 






Memory 


128MB 






Disk controller 


2 HSC70 






Disks 


(Disk configurations differed for 




each workload; see below.) 


Number of RA82 Disks per Workload 


Dedicated 




Compu- 




Use 


ECAE 


Share 


SDEW 


System 


1 


1 


1 


Page/swap 


1 


1 


1 


Library 






1 


Interactive 


2 


2 


4 


Batch 


3 




2 


Database 




6 





Note: Where necessary, software was distributed 
over multiple disks to avoid disk bottlenecks. 



Software Configuration 

VMS V5.0 - FT2.1 (A single-processor system was 
run with multiprocessing turned off.) 
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Performance metric — The multiprocessor effi- 
ciency measure is defined to be the relative multi- 
processor interactive throughput compared to 
the uniprocessor throughput. 

Also considered in this metric is system respon- 
siveness, based on acceptable service criteria for 
light, medium, and heavy tasks. This metric is 
used to evaluate the number of users supported 
at peak throughput while the system maintains 
the service time criteria System resources 
required to support each application are also 
identified 

Performance results — The multiprocessor 
efficiency measure is very close to linear in both 
the ECA£ and SDEW environments (Figure 10). 
This result shows that even in the multiuser inter- 
active environments near linear performance can 
be expected if the system is well balanced in 
terms of processor speed and the memory-to-pro- 
cessor bus speed. It also indicates the efficiency 
of the VMS SMP software. In the Compu-Share 
environment, the performance was slightly lower 
because of the high amount of disk and terminal 
I/O generated by this workload. The perfor- 
mance of the multiprocessor systems under sym- 
metric multiprocessing (SMP) depends directly 
on the amount of I/O. It is important to note that 
even with high amounts of I/O, the multiproces- 
sor efficiency measure is well over three for the 
four-processor system. 




0 VAX 621 0 VAX 6220 VAX 6230 VAX 6240 

KEY 

O ECAE 

□ COMPU-SHARE 
d SDEW 

Figure 10 Multiprocessor Efficiency Measure 
for All Multiuser Workloads 



At the peak throughput levels, response time 
criteria were maintained in each workload. 
Table 9 compares users supported and resources 
used by each of these workloads. The maximum 
number of users supported on the VAX 6240 are 
38, 1 20, and 1 26 users for ECAJE, Compu-Share, 
and SDEW, respectively. 

In terms of resource utilization, it should be 
noted that the multiprocessor synchronization 



Table 9 Summary of Workload Resource Utilizations 



Multiuser 


ECAE 


Compu-Share 


SDEW 


Number of users supported at the peak 


10,20,28,38 


30,60,-, 120 


36,66,90, 126 



Resource utilization 



Number of users 38 120 126 



CPU - 6240 

Percent utilized 100% 100% 100% 

Interrupt 2% 6% 4% 

Kernel 12% 29% 20% 

Executive 3% 7% 7% 

MP synch 1% 7% 2% 

User 82% 51% 67% 



I/O 



Disk I/O profile 


Bursty 


Uniform 


Bursty 


Average disk I/O per second 


24 


113 


68 


Average buffered I/O per second 


82 


112 


76 


Memory 


Maximum used (MB) 


32 


60 


57 
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under SMP is handled by spinlocks A spinlock is 
a bit in shared memory that is accessible by 
means of interlocked instructions by all pro- 
cesses through mutual agreement. Mutual agree- 
ment implies that a process can set the bit and 
gain access to the scheduler database if no other 
process has access to it. If a process tries to set 
the bit and the bit is already set, then the process 
continues to "spin" using a sequence of instruc- 
tions to continue checking to see if the bit is 
clear. MP synch is the amount of CPU time spent 
waiting to change the bit or acquire the spinlock 
and thus gain access to the scheduler database. 
MP synch is I percent for ECA£. 7 percent for the 
Compu-Sharc workload, and 2 percent for the 
SDEW workload. Since MP synch is the CPU time 
spent waiting to acquire spinlocks and indicates 
the amount of spinlock collisions, it shows the 
level of contention for shared resources experi- 
enced by SMP under each workload. For the 
Compu-Share workload, this level is significantly 
higher. The Compu-Share workload generates the 
most disk I/O compared to the other workloads, 
which may be the reason for a higher amount of 
time spent by this workload in MP synch. 

The following three graphs, Figures II, 12, 
13, present the CPU modes usage profiles. The 
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Figure 1 1 CPU Utilization over Time — 
ECAE Workload 
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Figure 12 CPU Utilization over Time — 
Compu-Share Workload 
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Figure 13 CPU Utilization over Time — 
SDEW Workload 
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Compu-Share workload shows higher but more 
uniform interrupt and kernel mode activities. 
Compu -Share's use of databases, which generates 
heavy I/O and local locking, is manifested in the 
heavy kernel and interrupt mode activity. SDEW 
does a fair amount of file manipulation. ECAE has 
much lower I/O activity than both Compu-Share 
and SDEW. 

The next three graphs in Eigures 14, 15, and 
16 compare the I/O profiles. The disk I/O on 
ECAE and SDEW is very bursty, and it is interest- 
ing to note that their relative CPU mode profiles 
correlate well, showing a relationship between 
the two The I/O on Compu-Share is high but not 
as bursty 

Comparing the disk I/O generated by the 
workloads and the effect it has on CPU utiliza- 
tion. Compu-Share puts the heaviest load on the 
multiprocessor system. However, even with all 
the synchronization necessary on this workload, 
the multiprocessor efficiency measure is fairly 
high (3 3) The ECAE and SDEW workloads show 
high multiprocessor efficiency measures of 3.8 
and 3 9. respectively. This level of gain in the 
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Figure 14 Disk I/O Utilization over Time — 
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Figure 1 5 Disk I/O Utilization over Time — 
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Figure 16 Disk I/O Utilization over Time — 
SDEW Workload 

multiuser environment on the multiprocessor 
systems shows that VMS SMP is working effi- 
ciently and that the VAX 6240 system is a well- 
balanced system in terms of the processor and 
bus speeds. 

Application Characteristics Affecting 
Multiprocessor Performance 

This section discusses some of the characteristics 
in applications that directly affect multiprocessor 
performance. 

Memory- to Processor Traffic 
Since these multiprocessor systems share mem- 
ory, contention to access memory could be a 
factor that affects multiprocessor efficiency. 
Therefore applications that generate lower mem- 
ory- to processor traffic do perform better, assum- 
ing there are no other bottlenecks in the system. 
One way to reduce this traliic is to organize the 
data to improve locality of reference. Data that is 
accessed together should be placed together. 

Disk I/O Operations 

With the symmetric multiprocessing software, 
I/O operations can be handled by each of the 
processors. As a result, the I/O-intensive appli- 
cations perform much better on the symmetric 
multiprocessor systems as compared to the asym- 
metric multiprocessing systems However, the 
I/O device interrupts are still handled by the pri- 
mary processor, even under SMP. By reducing the 
rate at which device interrupts are made, any 
contention for the primary processor can be 
reduced. To reduce the number of I/O inter- 
rupts, larger block transfers may be better in l/O- 
intensive applications Thus, an application that 
will lend itself to making larger block transfers 
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With minimum bus cycle time down by more 
than a factor of two and dynamic random-access 
memory (RAM) access time remaining relatively 
constant, the opportunity arose to increase per- 
formance by using static RAM to add a cache. 
Static RAMs with 35-ns access times and 64-kilo- 
bit (Kb) densities could be used for this purpose 
at reasonable cost. 

Design Partitioning and 
Functionality 

To facilitate implementation of the processor 
module using custom VLSI, the design was parti- 
tioned i nto seven major parts: the central process- 
ing unit with first-level cache, the floating point 
unit, the second-level cache, the memory con- 
troller, the Q22-bus interface, the system sup- 
port functions, and the clock circuitry. Each of 
these partitions was implemented by a single 
chip, with the exception of the second-level 
cache. This cache was implemented by pro- 
grammable array logic (PAL) anJ static RAMs. 
Five of the parts are connected directly to a 
32-bit multiplexed address/data (CDAL) bus: 
the central processing unit with first-level cache 
(CVAX), floating point unit (CFFA), second- 
level cache, memory controller (CMCTL), and 
Q22-bus interface (CQBIC). To reduce loading, 
the chip containing the system support functions 
(SSC) connects to a buffered version of this bus, 
the BCDAL. The clock circuitry (CCLK) was sepa- 
rated from the processor chip to conserve pins as 
well as to allow designers more flexibility in 
choosing a clock rate 

To maximize performance, the CVAX, CFPA, 
second-level cache, and CMCTL operate synchro- 
nously from a four-phase clock generated by the 
CCLK. The SSC and CQBIC operate asynchronous! y 
on a 40-MHz oscillator. The processor module 
was designed to allow the CCLK to be fed either 
from the 40-MHz oscillator or from a separate 
oscillator. The separate oscillator allowed the 
central processor and memory subsystems to be 
sped up when it was determined that the CVAX, 
CFPA, and CMCTL chips were capable of running 
ten percent faster than originally projected. 

Each of the major parts of the processor 
module is described in following sections. 

The Central Processing Unit and 
First-level Cache 

The CVAX chip is a microcoded 32-bit VAX CPU. 
To implement the entire VAX architecture using a 



single chip, the CVAX designers selected a subset 
of the full VAX instruction set and data types. 
The implementation includes 175 instructions 
and six data types (also implemented by the 
MicroVAX II system), plus 6 additional string 
instructions: CMPC3, CMPC5, LOCC, SCANC, 
SPANC and SKPC. The CVAX also provides micro- 
code support for emulation of 53 additional 
instructions (six less than the MicroVAX II) and 
five data types. When any of these instructions is 
decoded, an emulated instruction exception is 
generated. This exception causes a set of instruc- 
tion-specific parameters to be pushed on the 
stack and control to be passed to operating sys- 
tem emulation routines by the emulated instruc- 
tion vector in the system control block. As in the 
MicroVAX II, the remaining 70 instructions and 
three data types are handled by the CFPA chip. 
The CVAX implements the following registers: 

■ Sixteen, 32-bit, general -purpose registers 

■ Twelve VAX standard internal processor regis- 
ters to support memory management, process 
control, interrupts and system identification 
(SBR, SLR, MAPEN, TBLA, TBIS, TBCHK, PCBB, 
SCBB, IPL, SIRR, SISR, and SID)' 

■ Five internal processor registers specific to the 
CVAX to support the interval clock, first level 
cache, error reporting and console emulation 
(ICCS, CADR, MSER, SAVPC, and SAVPSL) 2 

The CVAX also provides a means for accessing six 
additional VAX standard internal processor regis- 
ters to support the time-of-year clock, console 
serial line, and I/O bus (TODR, RXCS, RXDB, 
TXCS, TXDB, and IORESET) .' These registers are 
implemented in the SSC. 

The registers in the SSC are referred to as 
"external" internal processor registers and are 
accessed by software in the same manner as other 
internal processor registers, that is, by means of 
MTPR and MFPR instructions. However, the CVAX 
chip generates a special cycle on the CDAL bus 
with the register number as an address. The SSC 
responds to these cycles by either supplying the 
CVAX with the register contents (MFPR) or per- 
forming the register update (MTPR). Accesses to 
other unimplemented VAX internal processor 
registers will also cause these cycles to be gener- 
ated, but the cycles will terminate with an error 
condition. (The cycles are timed out after four 
microseconds by a CDAL bus timer in the SSC.) 
When a register write is made to an unimple- 
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mcnted internal processor register, the CVAX 
ignores the error signal; the result is a long 
no-operation. When a register read of an unimple- 
mentecl internal processor register is attempted, 
the results are undefined. 

Also like the MicroVAX II system, the CVAX 
processor implements a memory management 
unit The unit supports full VAX demand-paged 
virtual memory, with single-level page tables for 
system space addresses and double-level page 
tables for process space addresses. In addition, 
four levels of access protection are supported by 
the memory management unit. A 28-entry, fully 
associative address translation buffer is provided 
for storing recent virtual-to-physical address 
translation (as opposed to an S entry translation 
buffer in the MicroVAX II). 

Unlike the MicroVAX II system, the CVAX in- 
cludes an on-chip (first-level), physical instruc- 
tion and data cache. Because chip area was at a 
premium, a 1 KB, two-way set associative organi- 
zation was chosen. In contrast to the second-level 
cache, this organization achieves a high hit rate 
for the available chip area through increased con- 
trol logic complexity instead of increased storage 
array size The extra control logic complexity of 
the first-level cache is more efficiently imple- 
mented in custom VLSI, whereas the large storage 
arrays of the second-level cache are more effi- 
ciently implemented with off-the-shelf parts. 
Since the first-level cache organization yields a 
set size equal to the memory page size, cache 
look-up and virtual-to-physical address transla- 
tion can be overlapped. Thus a cache cycle time 
equal to the processor microcycle time is 
achieved. 

The first-level cache is look-through; that is, 
cache hits on read cycles result in no activity on 
the CDAL bus, thus preserving its bandwidth for 
DMA transfers. The block size is one quadword so 
that cache misses on cacheable read cycles cause 
the CVAX to generate a quadword transfer on the 
CDAL bus. This transfer results in two longwords 
of data being returned in response to a single 
address. The minimum transfer time is two 
microcycles for the first longword and one for the 
second, which increases the effective CDAL bus 
bandwidth. Further, the first-level cache is write- 
through. However, to improve performance, the 
CVAX also contains a longword write buffer 
which allows the CPU execute out of the first- 
level cache while the write operation is being 
completed s 



The Floating Point Accelerator 

The CFPA chip works in conjunction with the 
CVAX chip to process floating point instructions 
and to accelerate the execution of some integer 
instructions (MULL, DIVL, and EMUL) The CVAX 
decodes the instructions and sends the CFPA 
control and opcode information by means of a 
dedicated eight-line control bus. The CFPA gets 
its operands from the CDAL bus. Unlike the 
MicroVAX II, all operands do not have to come 
from the CPU. Operands come from the CVAX 
only if they reside in the general-purpose regis- 
ters or first-level cache. If the operands reside 
in the second- level cache or main memory, the 
CFPA takes them directly off the CDAL bus. When 
the CFPA has completed the operation, it returns 
condition codes and exception status by means of 
the control bus, and the unaligned result by the 
CDAL bus. One, two, or three longword transfers 
may be required to transfer the result, depending 
on the type of operation. The CVAX aligns and 
sends the result to its ultimate destination. To 
improve DMA latency, the CVAX will grant the 
CDAL bus requests while waiting for the CFPA to 
return the result.* 

The Second -level Cache 

The second-level cache sits directly on the CDAL 
bus and bridges the 4-microcycle gap in access 
time between the first-level cache and main 
memory. The project goal for the second -level 
cache was to maximize system performance 
without placing the schedule at risk. Conse- 
quently, designers chose to use large storage 
arrays to achieve the desired level of performance 
(hit rate) rather than complex control logic By 
keeping the control logic simple, the cache 
could be implemented in PALs rather than cus- 
tom VLSI. Thus the chance of design errors was 
reduced as well as the time needed to correct any 
errors found during design qualification. 

The large storage arrays were easily imple- 
mented using off-the-shelf static RAMs The 
resulting design was a 64KB, direct-mapped, 
physical instruction and data cache with write- 
through. The implementation called for six PALs 
for control logic, eight l6K-by-4 static RAMs and 
four 1 6K-by-l static RAMs for the data store, and 
three 1 6K-by-4 static RAMs for the tag store, 

In keeping with the philosophy of simple con- 
trol logic, the second level cache is look-aside; 
that is, address decoding occurs in parallel in 
the cache controller and the memory controller. 
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Therefore, the eache does not have to regenerate 
CDAL bus eyelcs in the event of a caehe miss The 
second-level cache control logic simply 

■ Watches the CDAL bus cycles 

■ Returns data to the CVAX on cacheable read 
cycles that miss the first-level cache but hit the 
second-level cache 

■ Allocates a block on cacheable quadword 
CVAX read cycles that miss both caches 

■ Updates an entry on CVAX write cycles that hit 
the second- level cache 

■ Invalidates a block on DMA write cycles that 
hit the second -level cache 

■ Ignores DMA read cycles 

Because the second-level cache stores the same 
types of references as the first-level cache, very 
little control logic is required to determine 
which CVAX references are cacheable. The CVAX 
will only generate quadword CDAL bus cycles on 
cacheable CPU references that miss the first-level 
cache. Therefore, the second-level cache control 
logic only considers quadword read cycles 
cacheable 

To respond within the minimum CVAX bus 
cycle time (one microcycle for the second long- 
word of a quadword cycle), the second-level 
cache control logic uses an overlap scheme. The 
second-level cache overlaps the address genera- 
tion and the tag look-up for the second longword 
portion of the cycle with the data access for the 
first longword portion of the cycle. 5 

The Memory Controller 

The CMCTL chip is the interface between the 
CDAL bus and the memory array. The chip is a 
full 3 2 bit, single-ported, synchronous memory 
controller with 7-bit error-correcting code 
(ECC) and supports up to four memory array 
modules (two more than the Micro VAX II). 

The CMCTL longword write buffer minimizes 
the effect of write operations on CPU perfor- 
mance, (loth caches are write-through.) The 
CMCTL also supports multiword transfers on the 
CDAL bus. On these transfers, the CMCTL utilizes 
page mode in the dynamic RAMs to achieve the 
performance of an eight-way interleaved memory 
subsystem without the use of additional banks or 
interconnect complexity. The size of the transfer 
is encoded in bits 31 through 30 of the physical 
address (up to four longwords). Thus with only a 
single address, the memory controller can fetch 



sequential longwords in less time. Both the CVAX 
and the CQBIC utilize this feature to improve 
performance. The CVAX generates quadword 
transfers to fill cache blocks on a cache miss; and 
the CQBIC generates quadword, hexaword, or 
octaword transfers on block-mode DMA by 
devices on the Q22-bus. The combination of mul- 
tiword transfers and the look-through first-level 
cache made the added complexity of dual ports 
(as used in the MicroVAX II) unnecessary. To 
work effectively with the look-aside second-level 
cache, the CMCTL must monitor the CDAL bus 
after starting a memory operation. If the second- 
level cache responds with the data first, the 
CMCTL aborts its operation before completion 

To support a range of CVAX microcycle times 
and also maintain the performance advantage of 
synchronous operation, the CMCTL includes a 
programmable wait-state bit. This bit controls the 
number of CPU microcycles used to access the 
RAM array. Moreover this bit allows the same 
array modules to be used for processors with 
different microcycle times. 6 

The memory controller was not designed to 
support battery back-up because of the added 
design complexity and cost. For those applica- 
tions that require support during power outages, 
standby uninterruptable power supplies are a 
better solution and are available forsmall systems 
at low cost. 

The Q22-bus Interface 

The CQBIC interfaces the CDAL bus to the 
Q22-bus This chip provides address transla- 
tion between the 26-bit CDAL bus and 22-bit 
Q22-bus. In addition, CQBIC handles data 
buffering between the 32-bit synchronous/asyn- 
chronous CDAL bus and the 1 6-bit asynchronous 
Q22-bus. Q22-bus addresses are translated to 
CDAL bus addresses by a programmable mapping 
function (scatter-gather map), which is software 
compatible with the MicroVAX II system. This 
function gives the CPU the capability to map any 
page of the 4 megabyte (MB) Q22-bus address 
space to any page of the main memory address 
space. Thus Q22-bus DMA devices can transfer 
directly to or from discontiguous pages of main 
memory. CDAL bus addresses are translated into 
Q22-bus addresses by a direct mapping function. 
This function maps the 4MB Q22-bus memory 
space and the 8KB Q22-bus I/O space into the 
VAX I/O space. Thus the CPU can directly access 
Q22-bus memory or device registers by means of 
two ranges of I/O page addresses. 
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DMA write references are buffered in two natu- 
rally aligned octaword buffers and transferred 
to main memory by the most efficient combina- 
tion of multiword transfers. The two octaword 
buffers allow an entire block-mode transfer (up 
to 16 words) to be buffered by the CQBIC. After 
the first buffer has been filled by the Q22-bus 
device, it is emptied into main memory while the 
Q22-bus device fills the second buffer. Since the 
CDAL bus is faster than the Q22-bus, the first 
buffer is emptied and ready for input from the 
Q22-bus devicebefore thesecond buffer has been 
filled. This arrangement allows the interface to 
provide sustained throughput at maximum 
Q2 2-bus transfer rates with no additional latency. 

Q2 2-bus block-mode DMA read references are 
translated into quadword transfers on the CDAL 
bus. The four words are buffered in a single quad- 
word buffer and supplied to the DMA device on 
demand. Before the buffer is emptied, the next 
quadword is prefetched. This prefetch elimi- 
nates additional latency on all but the first trans- 
fer. To keep the latency of the first transfer at a 
minimum, the CQBIC responds to the DMA 
device after receiving the first longword of a 
quadword CDAL bus cycle, rather than waiting 
for the entire quadword transfer to complete. 

To fit the entire Q22-bus interface in a single 
chip, some changes had to be made to the bus 
interface architecture of the MicroVAX II system. 
On the MicroVAX II, the scatter-gather map was 
stored in a dedicated 32KB static RAM array 
within the bus interface. On the CQBIC, not 
enough space was available to implement this 
storage array internal to the chip. Moreover, not 
enough pins were available to provide a dedi- 
cated bus to an external static RAM array. The 
solution was to store the scatter-gather map in a 
32KB block of main memory and to implement a 
16-entry fully associative cache for map entries 
in the CQBIC. The cache functions in the same 
manner as an address translation buffer When 
translating a Q22-bus address, the cache is 
checked for the appropriate map entry. If the 
entry is found, the translation takes place at maxi- 
mum speed. If the entry is not found, then there 
is a delay while the entry is fetched from main 
memory. The translation is then performed. This 
delay is eliminated on DMA transfers that cross a 
page boundary, because the entry that maps the 
next page is prefetched when the DMA operation 
reaches a page boundary. On most DMA transfers, 
this delay is negligible because it is amortized 
over a large number of Q22-bus transfers. The 



design ensures that the operating system does 
not attempt to use the block of memory where 
the scatter-gather map resides. The on board 
firmware does not include these pages in a list of 
good memory pages that is passed to the operat- 
ing system at boot time. An interesting side effect 
of putting the scatter-gather map in main memory 
was that the relatively long latency on some 
Q2 2-bus DMA cycles uncovered latent design 
bugs in several Q2 2-bus DMA devices. The 
designs of these devices had been verified by 
empirical testing with existing processors rather 
than by testing to the Q22-bus specification. 

To maintain software compatibility with the 
MicroVAX II system, the scatter-gather map is ref- 
erenced through a 32KB block of I/O space 
addresses. The CQBIC responds to writes in this 
address range by buffering the data so the CVAX 
cycle can complete, updating the cache if there 
is a hit, requesting the CDAL bus, and updating 
the entry in main memory. If any DMA operations 
are pending, they are completed before CQBIC 
gives up the CDAL bus. This prevents multiple 
successive map updates by the CPU from locking 
out DMA activity long enough to cause Q22-bus 
devices to timeout (in 1 0 microseconds). 

On reads to this address range that miss the 
cache, the CQBIC has to latch the address and 
force the CVAX to retry the cycle. In this way, 
CQBIC can acquire the CDAL bus to fetch the 
entry from main memory. When the CQBIC relin- 
quishes the CDAL bus, the CVAX retries the cycle, 
and the CQBIC provides the processor with the 
requested map entry. This retry mechanism is 
also used to implement the interlocked instruc- 
tions in the VAX instruction set. 

On all interlocked instructions, the CVAX gen- 
erates one or more sequences of a read-lock cycle 
followed immediately by a write unlock cycle. 
The CVAX identifies these special locked cycles 
by placing a unique code on the parity lines at 
address time. The CQBIC recognizes the read- 
lock code and forces the CVAX to retry until the 
CQBIC can become master of the Q2 2-bus. Once 
the CQBIC has mastership of the Q22-bus, mem- 
ory is effectively locked and the cycle proceeds. 
The CQBIC releases the Q22-bus (unlocking 
memory) on the next CVAX bus transaction even 
if it is not a write unlock cycle. This release pre- 
vents memory from staying locked if the CVAX 
has to abort the instruction due to an error en- 
countered on the read-lock cycle. 

Like the MicroVAX II Q22-bus interface, the 
CQBIC gives the CPU the highest rather than the 
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lowest priority when arbitrating the Q22-bus. 
This priority assignment reduces interrupt 
latency, since the processor is delayed for a maxi- 
mum of one DMA transaction before being 
granted the bus to acknowledge the interrupt. 
Because the CPU accesses memory over a dedi- 
cated interconnect rather than through the 
Q22-bus, CPU references to the Q22-bus are very 
infrequent. Therefore this priority scheme does 
not have a negative impact on DMA performance. 

To support a range of CVAX microcycle times 
and fixed Q22-bus timing, the CQBIC was 
designed to run at a fixed clock rate, asynchro- 
nously to the CPU/memory subsystem Thisdesign 
made it easier for engineers to optimize perfor- 
mance of the slower asynchronous Q22-bus 
(where bandwidth is at a premium). These opti- 
mizations are made at the expense of lower per- 
formance on the faster CDAI bus (where there is 
extra bandwidth) due to synchronization delays. 

System Support Functions 

The SSC contains all those functions required to 
support the on-board firmware, the time-of-year 
clock, and the console serial line. The chip pro- 
vides the logic necessary to interface the two 
64KB read-only memories (ROMs) containing the 
firmware with the BCDAL bus. Since the ROMs are 
organized as a 64 K by 1 6-bit array, the SSC must 
generate two ROM cycles to satisfy each 32-bit 
CDAI. bus cycle. This ROM unpacking function 
saves board space as well as the costs related to a 
32-bit-wide ROM array. 

The SSC assists in the firmware emulation of a 
VAX console processor by providing two address 
spaces through which the ROM may be 
accessed — the halt- mode ROM space, and the 
run-mode ROM space. Any I -stream read from the 
halt-mode ROM space protects the processor 
from external halt conditions and extinguishes 
the front panel run light. Any I-stream read out- 
side the halt-mode ROM space, including reads 
from the run-mode ROM space, enables external 
halt conditions Under this condition, the front 
panel run light is illuminated. The firmware is 
organized so that console emulation code is exe- 
cuted from the halt-mode ROM space, and diag- 
nostics and boot code are executed out of the 
run-mode ROM space. The SSC also provides the 
firmware with 1 KB of battery-backed up RAM for 
storage of data structures and stack space, and a 
register for controlling four diagnostic LEDs 

The SSC also contains a VAX standard console 
serial line and a VAX standard battery backed up 



time-of-year clock. (The VAX standard serial line 
replaces the serial line chip used as the console 
on the MicroVAX II. The clock replaces the 
off-the-shelf clock chip.) Since the console con- 
trol/status registers (RXCS and TXCS), console 
data buffers (RXDB and TXDB), and the time-of- 
year clock (TODR) are VAX internal processor 
registers, they are accessed by means of special 
CDAI. bus cycles as described in the section The 
Central Processing Unit and First-Level Cache. 1 

To save board space and cost, the SSC provides 
two programmable address strobes for decoding 
additional board- level registers. These address 
strobes decode the second-level cache control 
register (CACR) and the MicroVAX 1 1 -compatible 
boot and diagnostic register (BDR). 2 

To prevent the processor from "hanging" on 
unanswered CDAL bus cycles the SSC provides a 
programmable watchdog timer for the CDAL 
bus. The timer starts at the beginning of a 
CDAL bus cycle. If the timer expires before the 
cycle completes, the SSC asserts the error line, 
causing the CQBIC or CVAX to abort the cycle. 
This timer could not be used for all CDAL bus 
cycles. To do so, the timer would have to be set to 
a value greater than the Q22-bus timeout value 
( 10 microseconds) so that CPU accesses to the 
Q22-bus would not be timed out prematurely 
Moreover, the timer would have to be set to a 
value much less than the Q22-bus timeout value 
so that unanswered CDAI. bus cycles would not 
cause Q22-bus timeouts during DMA. Since 
the CQBIC contains a 1 0-microsecond Q22-bus 
watchdog timer, the CDAI. bus timer was set to 
2 microseconds (greater than the longest CDAL 
bus cycle) and disabled on all Q22-bus refer- 
ences 

To support a range of CVAX microcycle times, 
the SSC was designed to run at a fixed clock rate, 
asynchronously to the CPU/memory subsystem. 
Since the performance of the functions in the SSC 
was not critical, the performance impact was not 
a concern. 8 

Hardware Interrupts 

The interrupt logic is spread among three chips: 
CVAX, SSC, and CQBIC. The CVAX provides four 
interrupt request pins that correspond to stan- 
dard VAX hardware interrupt request levels 
14 through 17 The CVAX does not provide an 
intcrrupt-acknowledge pin. The CVAX acknowl- 
edges interrupts when the processor's priority 
level is below the interrupt level by generating 
an interrupt acknowledge cycle on the CDAL bus 



84 



Digital Technical Journal 

No. 7 August l'J88 



CVAX-based 
Systems 



The "address" used is the level of the interrupt 
request being serviced. The data read is the offset 
of the vector within the system control block. 

The SSC contains the interrupt-acknowledge 
pin. The SSC responds to interrupt-acknowledge 
cycles whenever it has an interrupt pending at 
the level being acknowledged If the SSC does not 
have an interrupt pending at that level, it asserts 
the interrupt-acknowledge signal. The CQBIC 
passes interrupt-acknowledge cycles on to the 
Q22-bus only when the SSC asserts the interrupt- 
acknowledge signal. This interrupt-acknowledge 
scheme saves a CVAX pin, at the expense of 
requiring the devices in the SSC to have the 
highest interrupt priority at their level (IRQ 1 4). 

The CQBIC uses all four CVAX interrupt 
request lines to support the four Q22-bus inter- 
rupt request levels. (BR4 through BR7 are con- 
nected to the pins corresponding to IRQ levels 
14 through 17.) Since the Q22-bus has only 
one interrupt-acknowledge line, it is possible 
for a level 7 (17) device to steal an interrupt- 
acknowledge cycle intended for a level 4 (14) 
device. (This "steal" can occur if the level 7 
device is closer to the processor and posts an 
interrupt after the level 4 interrupt was acknowl- 
edged but before the acknowledgment reached 
it.) To prevent this situation from causing a level 
7 (17) device driver from running at a lower I PL, 
the CQBIC sets a bit that is returned along with 
the vector offset. This bit causes the CVAX to set 
the processor 1PL to 17 before passing control to 
the driver. If the bit is not set, the processor IPL is 
set to the level at which the interrupt request 
was received. The CQBIC also adds an offset of 
200 (hex) to the vector returned by the Q22 bus 
device so there is no conflict with existing VAX 
system control block entries. 

Performance Relative to the 
MicroVAX II Processor Module 

The reduction in gate delays due to the new chip 
technology allowed the processor microcycle 
time to be reduced to 90 ns (versus 200 ns for 
MicroVAX II) and the minimum bus cycle time 
to be reduced to 180 ns (versus 4 00 ns for 
MicroVAX II). The increase in the number of tran- 
sistors made available by the new technology 
allowed the following architectural mechanisms 
to be used to increase performance: 

■ A larger prefetch buffer ( 1 2 versus 8 bytes) 

■ A larger translation buffer (28 versus 8 entries) 



■ A 1KB, 90-ns. first-level cache 

■ A 64KB, 180-ns, second-level cache (instead 
of 1 MB of memory) 

■ Multiword transfers (longword, quadword, 
hexaword, and octaword versus longword) 

■ CPU write buffers (one longword) in the CPU, 
memory controller and Q22-bus interface 

■ Larger DMA buffers ( 1 6 words versus 2 words 
for writes, 4 words versus 2 words for reads) 

■ A 1 6-entry scatter-gather map cache 

The combination of reduced cycle times and 
architectural mechanisms produced a CPU per- 
formance 3 2 times that of the MicroVAX II (as 
measured by the mean of the distribution of 
results from over 150 CPU benchmarks). Addi- 
tionally, a slight increase in maximum I/O band- 
width was achieved (as measured by simulation 
with an ideal Q-bus master). 

Reliability 

Both the MicroVAX 11 design and the MicroVAX 
3500/3600 design were subjected to extensive 
thermal analysis. This analysis contributed to a 
board layout and chip packaging scheme that 
would minimize junction temperatures, thereby 
improving reliability. Both designs also ensure a 
high level of reliability by using preconditioned 
components that have passed a rigorous quali- 
fication program. 

Because of its increased complexity, the 
MicroVAX 3500/3600 was designed to be more 
tolerant of intermittent and transient failure 
mechanisms. ECC rather than parity is used to 
protect main memory, and the data path between 
the CPU and main memory (including both 
caches) is protected by byte parity. There are 
also four timers (three for the Q22-bus and one 
for the CDAL bus) to detect unanswered bus 
cycles. The CVAX can detect four types of CFPA 
errors, four types of memory management unit 
errors, one type of interrupt error and one type of 
microcode error. Errors that are detected syn- 
chronous to CPU execution are reported by 
means of a machine check on the same cycle on 
which the errors are detected. (Comparatively, 
the MicroVAX II reports the errors on the subse- 
quent cycle.) Unique machine check frames or 
hardware error flags are provided so that the 
proper error recovery routine can be invoked 
The recovery routines typically log the error, 
clear the error condition, retry the operation a 
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specified number of times, and continue if suc- 
cessful. If the routine is unsuccessful and the 
faulty hardware can be disabled, the system runs 
in a degraded mode until repaired. Otherwise, 
the system will crash. Errors detected asyn- 
chronously to CPU execution are reported by a 
high priority interrupt and are logged, but in 
most cases arc nonrecoverablc. Krrors that are 
corrected by hardware are reported via a lower 
priority interrupt, so they can be logged. 

Data from reliability qualification testing 
verified that the predominant failure mode was 
intermittent, suggesting that the error recover) 
capabilities built into the system would signifi- 
cantly increase the uptime of the system. 

Testability 

Most of the architectural mechanisms used 
to increase the speed of computer systems (such 
as caches and special purpose buffers) present 
testability problems. These mechanisms arc- 
almost always designed to be software transpar- 
ent, which makes them invisible to diagnostic- 
software. To solve this problem, special diagnos- 
tic modes are provided for the both the first- and 
second-level caches. The first-level cache diag- 
nostic mode provides a way for the CPU to explic- 
itly write the tag store and clear the valid bits by 
using selected instructions. The second-level 
cache diagnostic mode provides explicit access 
to both the tag and data stores through two 
blocks of I/O addresses (the cache diagnostic- 
space and the cache tag diagnostic space). 
Through the cache diagnostic space, the data 
store can be read or written, the tag store can 
be written and the valid bits can be cleared. 
When not in diagnostic mode, cache appears in 
this space as high speed RAM. During power-up 
self-test, diagnostic code is transferred from ROM 
to this RAM to allow fast execution of the code 
without requiring that main memory be func- 
tional Through the cache tag diagnostic space, 
the state of the cache tag bits, parity bits, valid 
bits, and several points within the cache control 
logic can be read. 

The MicroVAX 3500/3600 processor module- 
design also provides a diagnostic mode for main 
memory and a means of writing to main memory 
through the Q22-bus interface The main mem- 
ory diagnostic mode allows memory test times to 
be significantly reduced. Further, writing to main 
memory through the Q22-bus interface allows 
the scatter-gather map functionality to be tested 



without the assistance of another device on the 
Q22-bus 2 

Summary 

Having met performance goals, MicroVAX 3500/ 
3600 systems were shipping in volume within 
three years of the first shipments of MicroVAX II. 
At that time, two system packages, over twenty 
mass storage and communications options, three- 
operating systems, and over 200 software prod- 
ucts (for VMS alone) had been qualified and were 
available from Digital. Scores of hardware and 
software products were also available from third- 
party vendors. This offering would never have 
been possible without the level of compatibility 
that results from strict adherence to existing CPU 
(VAX) and I/O bus (Q22-bus) specifications 
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Design of the MicroVAX 3500/3600 
Second-level Cache 

The MicroVAX 3500/3600 processor module, the KA650, is a CVAX-based 
uniprocessor that incorporates an unusual cache architecture: a two-level 
cache. The first level is a small fast cache on the CPU chip, and the second 
level is a large, somewhat slower cache on the processor module. Along 
with high quality aiuihigh performance, time-to-market was a crucial goal 
for this third-generation MicroVAX system product. Consequently, project 
engineers adhered to a philosophy of design simplicity for the second-level 
cache. Cache performance measurements support their design decisions. 



Jhe MicroVAX 3500/3600 Project 

The primary goal of the MicroVAX 3500/3600 
project was simple. The chip designers in the 
Semiconductor Engineering Group (SEG) were 
working on a new single-chip VAX. CVAX.' The 
chip would have its own on-chip cache and was 
projected to achieve a performance level three 
times the original MicroVAX chip used in the 
MicroVAX II system. The MicroVAX Development 
Group would work in concert with the SEG effort. 
Our goal was to ship a high-quality, high-perfor- 
mance CVAX-based uniprocessor, which would 
be upward compatible with MicroVAX II systems. 
This new product must be available as soon as 
CVAX chips could be produced in volume. 

Ciiven the objectives of high quality and 
MicroVAX II system compatibility, the remaining 
design goals were carefully prioritized as listed 
below: 

1 . Time to market 

2. Raw computational performance 

3. Memory expansion 

-i . Direct-memory access ( DMA)/real-time per- 
formance 

5 System cost and price 

6 Additional functionality 

The importance of quickly delivering the 
MicroVAX 3500/3600 to market led to a close 
working relationship between the engineers in 
SF.G and MicroVAX Development We designed 
and built the MicroVAX CPU and memory mod- 



ules in parallel with the CVAX project, a process 
that relied heavily on simulation. In turn, the 
MicroVAX project team provided the initial 
debug testbed for CVAX: CVAX first booted VMS 
in a MicroVAX 3500/3600 system. 

Overview of the 

KA 650 Processor Module 

The system functional partition (Figure 1 ) shows 
how the KA650 processor module fits into the 
entire computer system. The processor module 
communicates to mass storage, communication, 
and other I/O devices over the Q22-bus. Main 
memory connects to the processor on a private 
memory bus which uses both the backplane and 
"over-the-top" ribbon cable A console panel car- 
ries bit rate and configuration swicches, a single- 
digit hexadecimal display, a connector for the 
console serial line, and a NiCd battery for the 
processor's time-of-year (TOY) clock. 

The module functional partition in Figure 2 
shows the basic parts of the KA650 processor 
module. All memory traffic flows over the CDAL 
bus (CVAX data/address lines) Only I/O space 
registers reside on the BCDAL bus (buffered 
CVAX data/address lines). 

The memory controller subsystem and the 
Q22-bus interface subsystem are each single 
chips, the CMCTL (CVAX memory controller) 
and CQBIC (CVAX Q22-bus interface chip). 2 * 
Most of the system support functions are con- 
tained in another chip, the SSC (system support 
chip) ' Each of these was designed in parallel 
with CVAX. as part of a complete CVAX chip set. 
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Figure I MicroVAX 3500/3600 System Functional Partition 



The primary problem left to the KA650 mod- 
ule designers was to balance two key goals, to 
design the board-level cache for the highest per- 
formance possible and to do so without endan- 
gering the project's time-to-market goal. 

Two -lev el Cache Architecture 
Description 

The KA650 is Digital's first commercially avail- 
able processor to incorporate a two-level cache 
The first level is a small cache on the CPU 
chip with a cycle time of one microcycle, or 
90 nanoseconds (ns). The second level is a large 
cache on the processor module with a cycle time 
of two microcyclcs, or 180 ns. In comparison, 
the cycle time of main memory system is five 
microcycles, or 450 ns 

The goal of each level of cache is to reduce 
effective memory access time on processor read 
cycles. At the chip level, the CVAX processor 
would prefer to use just one microcycle to access 
memory. However, the CVAX bus interface unit 
(BIU) requires two microcycles to access mem- 
ory off the chip. To compensate for this gap, the 



CVAX designers included an on-chip cache that 
could be accessed in one microcycle, and made 
the cache as large as practical. From the module 
perspective, CVAX can run a bus cycle as quickly 
as two microcycles. However, the memory system 
requires five microcycles to access main memory. 
To compensate for this second gap, the module 
designers included a module level cache that 
could be accessed in two microcycles, and made 
the cache as large as practical 

First- level Cache 

The first-level cache is a 1 kilobyte (KB), two-way 
set associative cache with a quadword block size 
The cache is organized as 64 rows, each row con- 
taining two sets, and each set containing 8 bytes 
Two bits in the cache disable register (CADR) 
select whether the first-level cache stores 
I-stream only, D-stream only (ordinarily used 
only for diagnostics), or both I-stream and 
D-stream references. The cache allocates a block 
whenever a cacheable read reference misses the 
cache. The CVAX BIU then generates a quadword 
read cycle to fill the allocated block. 
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The CVAX BIU waits to determine whether a 
read reference hits in the cache before starting 
the bus cycle to access memory. This wait helps 
free the processor bus for use by DMA devices, but 
requires faster RAMsin the second-level cache. 

The processor writes directly through the 
cache to memory. Therefore, when a cache block 
is allocated, the block being replaced need not 
be written back to memory. The CVAX BIU also 
incorporates a write buffer to support dump- 
and-run writes by the processor. If the CDA1 bus 
is busy when CVAX needs to write, the BIU will 
buffer one write cycle. The buffering allows the 
processor to continue execution, reading from 
the first- level cache. Thus, some write cycles 
require only one microcycle. 

When DMA devices write to main memory, the 
cache must be updated to reflect the change in 
main memory Cache data that is no longer con- 
sistent with the contents of main memory is 
called stale data. To prevent stale data from accu- 
mulating in the cache when DMA devices write to 
memory, the cache will check and invalidate one 
or two blocks as necessary. Invalidation ties up 



the first -level cache for three microcycles per 
quadword block and six microcycles for an octa- 
word. However, these delays stall CPU execution 
only if the CPU requires access to the cache dur- 
ing those microcycles. 

Second- level Cache 

The second-level cache is a 64KB direct-mapped 
cache, which like the first level, also has a quad- 
word block size. This cache is organized as 
8K rows, each row containing one set of 8 bytes. 

The second -level cache allocates a quadword 
block whenever CVAX reads a quadword that 
misses the second-level cache. (Quadword reads 
are ordinarily the result of allocation in the first- 
level cache. Unusual bit settings in the CADR, 
however, can cause the CVAX BIU to generate 
quadword cycles on reads without actually 
enabling the first-level cache.) Thus, the second- 
level cache will include the same kind of data as 
the first-level cache: I-stream only, D-stream 
only, or I- and D-stream references. 

Instead of waiting to determine whether a read 
reference hits in the cache, the memory con- 
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trollcr begins accessing memory in parallel with 
the tag look-up in the second-level cache If the 
reference hits in the cache, the memory con- 
troller will abort its response to CVAX (although 
the control cycle to the memory modules com- 
pletes normally) . 

Like the first-level cache, the second-level 
cache also writes directly through to memory 
The memory control Icr will perform a dump-and- 
run write if the write is an unmasked longword. 
Therefore many write cycles can complete in two 
microcycles. This completion time assumes the 
memory modules are not busy completing a pre- 
vious dump and-run write, aborted read cycle, or 
refresh cycle 

During DMA the second-level cache will also 
check and invalidate one or two blocks as neces- 
sary These checks prevent stale data from accu- 
mulating during DMA write cycles to memory. 

Design of the 

KA 650 Second-level Cache 

The importance of minimizing time taken to 
deliver the product to market made simplicity a 
high priority. For most major design decisions, 
we chose the simplest implementation. 

Cache Speed 

The cache speed was determined by the fastest 
CVAX bus cycle. CVAX can read or write a single 
longword in two microcycles ( 1 80 ns) and read a 
naturally aligned quadword in three microcycles 
(270 ns). Each added wait state costs another 
microcycle (90 ns). For example, a typical quad- 
word read from main memory requires five 
microcycles for the first longword and three 
microcycles for the second longword — a total of 
720 ns Therefore the goal of the second-level 
cache was to allow CVAX to execute from mem- 
ory with no wait states. Preliminary timing dia- 
grams determined that to keep up with a 100-ns 
CVAX the cache would require 4 5-ns static RAMs 
When later in the project KA650 module design- 
ers changed the clock speed from 100 ns to 
90 ns, they also replaced the 4 5-ns cache RAMs 
with 35-ns RAMs. 

Cache Size 

Increasing a cache's size always improves its 
performance Since high performance was a 
major priority, choosing the cache size was sim- 
ply a matter of finding the largest RAM that would 
run fast enough, fit on the board, and not risk the 



schedule. At the beginning of the project, we 
doubted that 2 5 6-ki lobit (Kb) static RAMs with 
4 5-ns access time would be available soon 
enough. However, we expected 64Kb RAMs to be 
mature when Manufacturing would need produc- 
tion volumes of the parts for the MicroVAX 3500/ 
3600 system. 

The 64Kb RAMs were available in three organi- 
zations: 64K by 1, 16K by 4, and 8K by 8 We 
could have arranged these to form a 256KB cache 
(using 32 64K-by-l RAMS) , a 64KB cache (using 
8 l6K-by-4 RAMs) or a 32KB cache (using 4 8K- 
by-8 RAMs) The 256KB cache would not have 
even fit on the module, and so was not consid- 
ered The 64KB cache would fit (requiring only 
slightly more module space than the 32KB 
cache) and was actually cheaper than the 32KB 
cache. So naturally we chose the 64KB cache. We 
then added four 1 6K-by- 1 RAMs for byte parity. 

Cache Organization 

We quickly ruled out organizing the cache with 
more than one set. More than one set would 
either require too much logic or run too slowly 
To get data fast enough from the correct set on 
a read hit would require a multiplexer and a 
separate set of RAMs for each set. This additional 
logic would take more space than we had avail- 
able Another possibility was to use a "select set" 
signal generated from the tag-store match signals 
as an address bit into the data store RAMs This 
organization, however, would run too slowly. 

The cache performance simulation data avail- 
able to us assumed the cache was flushed on 
every context switch. We felt this assumption 
might be overly pessimistic for caches as large as 
64KB. Furthermore, we expected that more real- 
istic data would not show a large performance 
advantage for a two-way set associative cache 
over a direct-mapped cache. We therefore chose 
the simpler direct-mapped organization. 

Block Size 

When choosing the block size for the second- 
level cache, we again decided in favor of simplic- 
ity We chose to make the second level cache use 
the same size block as the first-level, which was 
already set at a quadword. At quadword block 
size, the second-level cache can allocate a block 
simultaneously with the first-level cache. The 
second-level cache simply captures the data from 
the quadword read as it comes from memory over 
the CDAL bus. 
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We had several additional reasons for not 
choosing either a longword block size or a size- 
larger than a quadword. Use of a longword block 
size in the second-level cache complicates the 
control logic and potentially degrades perfor- 
mance. To respond to a CVAX quadword read, the 
cache would require two separate tag look-ups. If 
the first look-up hit but the second look-up 
missed, the cache would have to retry the bus 
cycle. The retry would invalidate the block in the 
first-level cache and waste bus bandwidth. On the 
other hand, use of a block size larger than a quad- 
word would require extra data path and control 
to perform block fill operations. 

Tag Store Organization 
Once we knew the data store size (64KB), orga- 
nization (direct-mapped), and block size (quad- 
word), we could determine the organization of 
the tag store. 

The tag store requires one row for each of the 
8, 1 92 quadword blocks of the data store. Of the 
CVAX 30 bit physical address, 13 address bits 
(bits 15 through 3) are used to select the quad- 
word block and associated tag store row. Each tag 
row must store a parity bit, a valid bit, and 
enough of the memory address to specify where 
in main memory the quadword block of data 
came from. Since the KA650 would architec- 
turally support no more than 64MB, address 
bits 29 through 26 would always be zero to 
access main memory. This left 10 address bits 
(bits 25 through 16) to be stored in the tag 
row. Therefore the tag store would require 
8,192 words of RAM, each word consisting of 
10 tag bits plus a valid bit and a parity bit. 

To make this 8K-by I 2 array, we used three of 
the same 1 6K-by-4 RAMs used in the data store 

We did examine the special 2K-by-9 tag-store 
RAMs being developed by some memory vendors. 
We concluded that these RAMs were too small 
and their availability too risky for the KA650. 

Look-aside Architecture 
The design of the first-level cache keeps most 
of the processor memory traffic off the CDAL 
bus. Instead of this "look-through" design, the 
second-level cache uses a "look-aside" architec- 
ture which simplifies the bus data path and con- 
trol and improves performance on cache misses 

In the lookaside architecture, both the 
second-level cache and the memory controller 
reside on the same bus. When CVAX starts a read 



cycle, the memory controller begins accessing 
main memory in parallel with the tag check in 
the second-level cache. If the cycle misses the 
cache, then main memory is prepared to respond 
as quickly as possible. If the cycle hits the cache, 
the memory controller senses the hit and aborts 
its response to the bus cycle. A drawback of this 
scheme is that the memory controller must still 
complete the control cycle to the dynamic RAMs 
of main memory. Consequently, the controller 
cannot respond as quickly as it had initially 
if the cache hit is immediately followed by a 
cache miss. We expected this penalty to be 
insignificant . 

The alternative to a look-aside architecture 
would be to place the memory controller on a 
separate bus. The bus cycle would pass to the 
control ler only after the cycle missed the second- 
level cache. This design would have improved 
the efficiency of main memory usage. However, 
this design requires additional data path and con- 
trol to create the separate memory bus, and 
reduces processor performance by adding at least 
one additional microcycle to the penalty for a 
cache miss. 

Handling of Write Cycles 
We chose a simple write-through design for the 
second-level cache instead of a more complex 
write-back design. The penalty of not using 
write-back is reduced by the CMCTL dump- 
and-run write feature. When CVAX writes an 
unmasked longword to main memory, the CMCTL 
latches the address and data and terminates the 
bus cycle before the write to main memory is 
actually completed. If write cycles occur back to 
back (which is common for VAX processors), 
then the second write will be delayed while 
the first one completes. However, many write 
cycles can still complete in the minimum two 
microcycles. 

DMA Access to the Cache 
To maintain design simplicity, we decided not to 
allow DMA to read or write the second-level 
cache. This section discusses several of the 
possibilities we considered and rejected. These 
include DMA reads, DMA write-through, and a 
cache without val id bits. 

First, we considered allowing the CQBIC 
(which is the only DMA device on the CDAL bus) 
to read from the second-level cache However, 
the cache control logic is synchronous with the 
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CVAX clocks The control logic design would 
have been significantly complicated if that logic 
had to respond to the CQBIC, which runs asyn- 
chronous to the CVAX clocks. 

Second, the cache must recognize DMA writes 
to memory to prevent stale data from accumulat- 
ing in the cache. We considered letting DMA 
write cycles write through the cache, but again 
concluded the timing was too complex to be 
practical (Bus parity was also a concern, which 
is discussed in the section Cache Parity .) Instead, 
the second-level cache latches the address and 
simply invalidates one or two blocks if the 
address hits in the cache. 

Finally, while considering DMA write-through, 
we thought about designing the cache without 
any valid bits. Power-up routines in the read-only 
memory (ROM) code could initialize the cache 
to match main memory. The cache would then 
remain consistent with memory unless an uncor- 
rectable ECC (error correcting code) error was 
encountered in main memory. When that error 
occurred, the cache would simply disable read 
hits until the operating system could restore con- 
sistency with main memory by writing the quad- 
word block containing the error. Of course once 
we decided against DMA write-through, we had 
to include valid bits 

Cache Parity 

To improve the integrity of the second-level 
cache, both the data store and the tag store of the 
second-level cache are protected by parity. 

Data Store Parity — Data store parity was sim- 
plified by taking advantage of the CDAL bus par- 
ity supported by CVAX and CMCTL. The data 
store simply stores and returns parity captured 
off the bus, and asserts CDPE (CVAX data parity 
enable) to have CVAX check the parity. 

This parity checking scheme was another rea- 
son we rejected DMA write-through, since CQBIC 
neither generates nor checks CDAL bus parity. 

One drawback to this simple scheme is that the 
processor cannot easily determine the source of a 
CDAL bus parity error. A CDAL bus parity error 
can be caused by a cache failure, a CMCTL fail- 
ure, or an actual bus fault (such as open etch). 
This lack of isolation makes error diagnosis 
difficult or impossible when CVAX detects a 
CDAL bus parity error. 

One useful feature we did not think to include 
was a control register bit to disable the assertion 
of CDPE and the subsequent parity checking by 



CVAX. Such a bit would allow a machine check 
handler to isolate a failing bit in the data store. 
Without this control register bit, software can at 
best determine in which byte the error resides; if 
multiple bytes have errors, only one byte can be 
identified. 

Tag Store Parity — The tag store parity must be 
generated and checked by the tag store itself. 
Two separate parity trees are used: 

■ The predictive parity tree 

■ The error-checking tree 

The predictive parity tree generates the parity 
of the tag field of the address This tree predicts 
what the parity stored in the RAM must be for the 
bus cycle to hit in the cache. Predictive parity is 
fast because the parity is calculated while the tag 
RAMs are looking up the tag. This scheme does 
not delay the tag comparison and is sufficient to 
guarantee that bad parity stored in the tag RAMs 
will force a cache miss. However, it is not 
sufficient to determine whether the parity in the 
RAMs is actually bad. Thus, a second parity tree, 
the error-checking tree, is needed. 

The error-checking tree identifies bad parity in 
the cache tag RAMs. The output of this second 
tree is checked after the hit/miss decision is 
made, to determine whether a miss was caused 
by bad parity. If bad parity is detected, the cache 
control register error bit is set, the cache-enable 
bit is cleared, and an interrupt is posted to the 
processor. Since the bad parity forced a miss, no 
state is corrupted, and a process or system crash 
is averted. 

Second-level cache tag parity covers both the 
10 tag bits and the valid bit to protect against 
erroneously set valid bits. 

Cache Diagnostic Space 
Early in the project we recognized the value 
of being able to directly access the cache as 
64KB of fast RAM. Thus we created "cache diag- 
nostic space" in the 64MB address range from 
1000 0000 to 13FF FFFF. In cache diagnostic 
space, the cache RAM appears as 1 ,024 copies of 
the 64KB of cache The cache responds to all 
CVAX read and write cycles in this address range, 
effectively forcing a cache hit. For simplicity, 
DMA access to cache diagnostic space is not per- 
mitted. 

During power-up self-test, some diagnostics 
are relocated from the boot/diagnostic ROM 
to cache diagnostic space for faster execution. 
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Cache diagnostic space was also useful at initial 
debug of the CVAX chip set. We were able to 
downline load diagnostic programs through the 
console serial line and execute them from the 
cache diagnostic space. With the diagnostic pro- 
grams in this cache space, we could continue 
debug work on the module without relying on 
either main memory or the Q-bus interface. 

Writing to cache diagnostic space could cor- 
rupt normal cache operation by creating stale 
data in the cache. To prevent this, write cycles to 
cache diagnostic space normally invalidate the 
tag for that address. This invalidation also pro- 
vides a simple means for flushing all or part of the 
cache To simplify diagnosis of cache faults, a 
diagnostic mode bit in the cache control register 
can be set to cause writes to cache diagnostic 
space to set the valid bit instead of clearing it. 
Setting the diagnostic mode bit also clears the 
cache enable bit Thus normal allocation and 
DMA invalidation are prevented from acciden- 
tally upsetting a diagnostic pattern being written 
into the cache. These features simplify the task of 
putting the cache in a specific state for diagnostic 
purposes. 

Performance Measurements 

Measurements of second- level cache perfor- 
mance bear out that the fundamental architec- 
tural decisions were sound. 

The measurements were performed on a small 
system consisting of a KA650 CPU with 16MB of 
main memory, an RQDX3 disk controller with an 
RD54 hard disk, and a DEQNA Ethernet interface. 
The CPU module was modified with additional 
circuitry to detect various kinds of cacheable bus 
cycles. The system ran VMS version 4. 7 A. To heav- 
ily load the system with reasonably realistic 
workloads, we used varying combinations of 
three basic tasks. 

■ Assembling and linking a large program writ- 
ten in VAX MACRO 

■ Running a CAD program that compares the 
topology of two large net lists 

■ Copying large files (greater than 8,000 
blocks) across the network 

Four 16 bit counters and a logic analyzer were 
used to log the occurrence of particular bus 
cycles. For each measurement, the cache perfor- 
mance was monitored continuously for 5 to 30 
minutes (depending on the workload and type of 



bus cycle) to collect a total of 268 million 
sequential bus cycles of interest. For example, to 
study the read hit rate, the four counters simulta- 
neously collected: 

■ The total number of cacheable quadword read 
cycles 

■ The number of cacheable quadword read 
cycles that hit in the second-level cache 

■ The number of cacheable quadword read 
cycles that missed the second level cache 
(Counting both the cache hits and misses pro- 
vides a useful error check.) 

■ The number of cacheable quadword read 
cycles that hit in the cache, or that would have 
hit if the valid bit had been set 

Since CVAX gives no external indication when 
a memory read is satisfied by the internal cache, 
only reads that miss the first level cache (and 
therefore generate a bus cycle) can be directly 
measured. Thus, it is very important to note that 
the read hit rate of the second-level cache alone is 
not the same as the read hit rate of both caches 
taken together as a single whole (which is beyond 
the scope of this paper). This is not a problem for 
write cycles because the first-level cache is write 
through. 

Test Results 

For the workloads tested, the read hit rate was 
typically 85 percent and ranged between 82 per- 
cent and 91 percent. This is what we intuitively 
expected: the large size of the cache would keep 
the hit rate high, even though the first-level cache 
tends to strip off much of the memory access 
locality. 

We measured the read hit rate of the second- 
level cache with the first level cache turned off, 
just to get an idea of how well a simple but large 
cache can perform. The memory read hit rate 
ranged between 96 percent and 99 percent when 
the first-level cache was turned off. This demon- 
strates that even a simple direct-mapped cache 
performs well if it is large enough. However, note 
that turning on the first-level cache tends to radi- 
cally alter the bus traffic seen by the second level 
cache. Therefore a direct comparison between 
hit rates with and without the first-level cache 
can be misleading. 

The "would have" hit rate is a measure of what 
the read hit rate would have been if DMA write 
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cycles wrote through the cache instead of invali- 
dating the cache. The modifications to the CPU 
module included an extra tag comparator that 
ignores ihe valid bit Once the cache has initially 
filled, valid bits are cleared only by DMA invali- 
dates. If the tag matches but the valid bit is 
cleared, then the cache miss was caused by a 
DMA invalidate and would have been a hit if the 
DMA cycle had written through. 

The "would have" hit rate showed the benefit 
of DMA write through would have been negligi- 
ble The incremental improvement in hit rate was 
typically 0 I percent, though in one case it rose 
to about I 3 percent (copying large files over the 
network, with no other computational tasks). 
This improvement is lost in the noise when com- 
pared to the normal task-to-task variation in hit 
rate Again, this is what we intuitively expected: 
DMA tends not to write into memory currently in 
use by the processor. Clearly we made the right 
decision to avoid the added complexity of DMA 
write through. 

Memory write cycles were also measured for 
the same tasks as memory reads. However, instead 
of measuring the "would have" hit rate, we 
counted the number of cycles that took longer 
than two microcycles to complete. This gives us 
some measure of the effectiveness of the CMCTL 
dump-and-run write buffer. 

The memory write hit rate ranged between 
77 percent and 89 percent. Of all memory write 
cycles, 46 percent to 63 percent took longer than 
two microcycles (the minimum write cycle 
time); and 37 percent to 44 percent took longer 
than two microcycles and hit in the cache. 

We had hoped more cycles could take advan- 
tage of the dump and run write buffer in the 
CMCTL However, this performance is still good 
for the relative simplicity of the CMCTL write 
buffer Also remember that the CVAX internal 
write butter helps shield CPU performance from 
the delays of many write cycles The complexity 
and schedule risk of adding another write buffer 
or designing the cache for write-back operation 
would not have been justifiable. 

To examine the relative impact of the two-level 
cache on processor performance, we ran bench- 
marks with both caches enabled, each cache 
alone, and both caches turned off'. Table 1 shows 
some typical results normalized to the perfor- 
mance of the KA650 with both caches turned on. 
Performance of the MicroVAX II is shown for 
comparison 



Table 1 Comparison of Benchmark Results 
for First- and Second-level Caches 

Second- First- 
level level 



Bench- 


Neither 


Cache 


Cache 


Both 


MicroVAX II 


mark 


Cache 


Only 


Only 


Caches 




HANOI 


0.45 


0.70 


1.00 


1.00 


0.42 


PRIME 


0.68 


0.81 


0.97 


1.00 


0.24 


FFT45 


0.52 


0.69 


0.91 


1.00 


0.28 


JACOBI 


0.47 


0.65 


0.93 


1.00 


0.27 


CAE2 


0.51 


0.69 


0.95 


1.00 


0.31 



Bach cache provides a significant performance 
boost, but performance with the first-level cache 
alone is better than performance with the second- 
level cache alone. The faster cycle time and 
two-way associativity of the first-level cache out- 
weighs the large size of the second-level cache. 
An extreme example of this is the Towers of 
Hanoi benchmark, where the performance of 
both caches together is no better than that of the 
first-level cache alone. 

Conclusions 

At the project close, we had met our fundamental 
goals The MicroVAX 3500/3600 CPU is compat- 
ible with the MicroVAX II but delivers three 
times the performance — performance attribut- 
able in part to the two-level cache. And because 
we adhered to a simple design approach, the new 
system was ready to ship as soon as CVAX chip 
sets were available in production volumes. 
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The CVAX 78034 Chip, 

a 32-bit Second-generation 

VAX Microprocessor 

The MicroVAX 78034 chip — also known as CVAX — is a second-genera- 
tion single-chip VAX microprocessor. A primary project goal was to 
develop a chip with three times the performance of the first single-chip VAX 
processor, the MicroVAX 78032. Therefore, architecture and circuit design 
efforts were directed toward decreasing ticks per instruction (TP1) and 
machine cycle time. The designers reduced the TPl by 27 percent and 
achieved a 90-nanosecond (ns ) cycle — a significant improvement 
over the 200-ns cycle time of the first-generation chip. Implemented in a 
2-micron CMOS process, the chip comprises six major functional units. 
These include the instruction queue, execution unit, memory management 
unit, bus interface unit, microsequencer and control store, and a unique 
on-chip cache. 



The CVAX 7803-4 CPLI chip is a second -genera- 
tion, single-chip VAX microprocessor. This chip 
is the CPU of the MicroVAX 3500 and 3600 com- 
puter systems, which have approximately three- 
times the performance of the MicroVAX II com- 
puter system 1 2 The VAX 6200 family of systems 
uses slightly faster 80-ns (speed-binned) CVAX 
CP! ! chips in a multiprocessor configuration In 
this paper, we describe the CVAX chip and 
explain how the increase in performance was 
achieved. 

Project Goals 

The primary project goal was to develop a single- 
chip (PU that implemented the VAX architecture 
and delivered three times the performance of the 
MicroVAX 78032 CPU chip used in the MicroVAX 
II computer systems. Of the several elements in 
this goal, performance presented the greatest 
design challenge 

The performance of a CPU is inversely propor- 
tional to the product of ticks per instruction 
( TPl) 1 and the machine cycle time. TPI depends 
on the performance of the system architecture. 
The minimum machine cycle time depends on 
circuit speed and on how the architecture is 



implemented. In the CVAX chip, both the TPI 
and the machine cycle time were improved to 
meet the performance goal 

Much effort went into reducing the TPI. By 
way of comparison, the MicroVAX II system, 
which is based upon the MicroVAX 78032 chip, 
performs at approximately 115 TPI; whereas 
the MicroVAX 3600 system, which uses the 
CVAX 78034 chip, performs at approximately 
8.4 TPI . The TPI was lowered mainly by reducing 
the average number of cycles required to access 
memory This reduction in the number of cycles 
was achieved by the inclusion of the following 
architectural features in the system: 

■ A I-kilob)te (K_U), on-chip instruction and 
data stream cache, which is capable of a long- 
word read each cycle 

■ A 64KB, second-level cache on the board, 
which is capable of a longword read or write- 
in two cycles and a quadword read in three 
cycles 

■ A 28-cntry translation butter (TB), which 
achieves a high hit rate for virtual-to-physical 
address translation 
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Table 1 CVAX Instruction Set Architecture 
Instruction Type Number 



Implemented Fully by CPU 

Integer/logical 89 

Address 8 

Bit field 7 

Control 39 

Procedure call 3 

Miscellaneous 10 

Queue 6 

System support 1 1 

Character string 8 

Subtotal 181 



Implemented by Floating Point Chip 

F floating 24 

D floating 23 

G floating 23 

Subtotal 70 



Implemented Partially by CPU 

Character string 3 

Decimal 16 

Edit 1 

CRC 1 

Subtotal 21 



Implemented Fully in Macrocode 

H floating 28 

Octaword 4 

Subtotal 32 

Total 304 



Other factors influencing the lower TPI are as 
follows: 

■ More efficient microcode was implemented for 
some instructions. In general, most complex 
instructions, such as CALLx, RET, PUSHR, 
POPR, and INSV, were coded for speed rather 
than for space. 

■ Six additional instructions were implemented 
in microcode. These instructions are CMPC3, 
CMPC5, LOCC, SKPC, SCANC, and SPANC. 

■ The instruction decode section decodes all 
specifiers instead of relying on the microcode 
to decode some specifiers. 

The machine cycle time reduction was deter- 
mined in part by the technology chosen for fabri- 



cation. The first-generation chip, the MicroVAX 
78032 CPU, has a 200-ns cycle time and was 
implemented in a 3-micron NMOS process. In 
comparison, the CVAX 78034 CPU chip had a 
goal of a 90-ns cycle time and was implemented 
in a 2-micron CMOS process. However, only 
60 percent of the improvement in the CVAX 
cycle time results from the fabrication process. 
The remainder results from architectural and cir- 
cuit innovations, which are described in the sec- 
tion Internal Organization. 

The section following presents an overview of 
the CVAX architecture. 

C VAX A rchitecture 

The CVAX 78034 CPU chip implements the VAX 
architecture, which has 16 general-purpose reg- 
isters, the processor status longword, and 18 mis- 
cellaneous privileged registers. All 304 VAX 
instructions are supported by the system. 4 The 
chip fully executes 181 instructions and pro- 
vides microcode operand parsing for 21 instruc- 
tions that are emulated with macrocode. The 
chip passes 70 F, D, and G floating point instruc- 
tions to a companion floating point chip. The 
remaining 32 instructions are fully emulated in 
macrocode. Table 1 summarizes the instruction 
set architecture. 

The chip memory management hardware and 
microcode provide a demand-paged virtual mem- 
ory environment. The virtual memory size is 
4 gigabytes, and the physical address space is 
1 gigabyte. 

External Interface 

The CVAX bus provides a flexible interconnect 
protocol between all CVAX family members. The 
primary data bus is 32 bits wide and is time mul- 
tiplexed to share addresses and data. Up to four 
longwords can be transferred with each address. 
Strobes provide timing information for syn- 
chronous and asynchronous devices. Direct mem- 
ory access (DMA) request and grant signals are 
used to control arbitration of the data and address 
line (DAL) bus between the CPU and peripheral 
chips. 

Shown in Figure 1 , the CVAX 78034 CPU chip 
is a synchronous device on the CVAX bus. In addi- 
tion to supporting the CVAX bus protocol, eight 
dedicated pins support a floating point coproces- 
sor interface. These pins are time multiplexed 
between the CPU chip and the coprocessor chip 
to transfer control and status information. 
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Figure 1 CVAX 78034 External Inter/ace 
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A clock chip generates pairs of 180-degree 
phase-shifted clock signals that are distributed to 
all synchronous MOS components in the system. 
The clock also generates auxiliary pairs of clocks 
that can be used by any non-MOS components in 
the external interface. Separation of the clocking 
for MOS and non-MOS elements provides better 
skew control forthe critical MOS clock signals. 

Microarchitecture 

The CVAX 78034 CPU chip has some pipelining 
and is microprogrammed. The chip comprises six 
major functional units: 5 6 7 

■ Instruction decode and prefetch queue 
(I-Box) 



■ Execution unit (E-Box) 

■ Memory management unit (M-Box) 

■ Bus interface unit (BIU) 

■ Cache 

■ Microsequencer and control store 

The photomicrograph in Figure 2 and the block 
diagram in Figure 3 illustrate all functional units 
on the chip. 

Internal Organization 

This section describes the six major functional 
units of the chip. As noted earlier, the emphasis 
here is on those aspects of the design that en- 
hanced the machine's performance. In addition, 




Figure 2 Photomicrograph of the CVAX CPU Chip 



98 



Digital Technical Journal 

No. 7 Auuust 1988 



TO 

COMPANION 
FLOATING POINT 
CHIP 



TO ASYNCHRONOUS 
INTERRUPT HNS 




OB7M 



/7gwre J OT fl/ocfc Diagram 



ft 

a 



The CVAX 78034 Chip, a 3-2 bit Second-generation VAX Microprocessor 



at the end of this section we discuss the design 
approach taken to build in chip testability. 

The flow of information between all functional 
units on the chip is synchronized by four on-chip 
clock phases of nominally equal duration . All cir- 
cuits were designed to function with the partial 
phase overlap or underlap that can result from 
external clock skew and variations in the fabrica- 
tion process. 

Instruction Decode and 
Prefetch Queue 

The instruction decode and prefetch queue, the 
1-Box, controls macroinstruction sequencing and 
instruction stream prefetching. 8 During a micro- 
cycle, the I-Box determines what the next micro- 
code dispatch will be, based on the instruction 
stream data and the current processor state. 

The CVAX I-Box is designed to generate the 
microcode dispatch address for every specifier 
flow. This design differs from the MicroVAX CPU 
78032 chip design; there, the I-Box provides the 
dispatch address for just the first rwo specifiers of 
a macroinstruction and relies upon the micro- 
code to generate the dispatch address for addi- 
tional specifier flows at a performance cost of one 
microcycle per specifier. 

Primary subsections of the CVAX 78032 I-Box 
include the instruction decode read-only memory 
(ROM), the dispatch programmable logic array 
(PLA), and the prefetch queue. 

The instruction decode ROM (IROM) contains 
the information about VAX macroinstructions 
that is required to parse the instruction stream. 
The IROM determines the number of specifiers 
for an instruction, the sizes of its operands, and a 
partial microaddress for the execution micro- 
code of the instruction. 

The dispatch PLA examines I-Box state, instruc- 
tion stream data, and other microprocessor states 
to predict the next hardware-supplied microad- 
dress for the microscquencer. This PLA is self- 
timed and evaluates in slightly under one clock 
phase 

The I-Box instruction prefetch queue operates 
in parallel with the instruction execution hard- 
ware on the chip. Whenever a longword in the 
instruction prefetch queue is empty, the 1-Box 
issues a request to the M-Box to read the next 
aligned longword in the instruction stream. If the 
M-Box and BIU are not doing some other read or 
write operation, they will fetch the requested 



longword and send it to the instruction prefetch 
queue. 

When a microinstruction that loads the pro- 
gram counter register is detected, for example, 
during a branch instruction, the prefetch queue 
is flushed. A new instruction must then be 
fetched before the processor can proceed. 

Up to three prefetched longwords of instruc- 
tion stream data can be queued by the prefetch 
queue. In addition, the prefetch queue rotates 
the instructions to bring the opcode to the front 
and extracts in-line instruction stream data for 
use by the E-Box 

Execution Unit ( E-Box) 
The main functional blocks in the execution unit, 
the E-Box, are the register file, program counter 
(PC), constant generator, shifter, and arithmetic 
and logical unit (ALU). The data path has two 
precharged 32-bit read buses, called the A and B 
buses, and a static write bus, called the W bus. 
The functions performed by the E-Box during a 
cycle are determined by the current microin- 
struction and internal state Following are 
descriptions of each of the main functional 
blocks. 

The register file contains 3 1 single-read-port/ 
single-write-port registers and 8 dual-read-port/ 
single-writc-port registers. The register file is 
used in the data path where compact layout is 
especially important Therefore, to save chip area 
the register file cell was designed using an NMOS 
pass gate rather than a full transmission gate. 

The 32-bit PC is located in the data path along 
with the program counter adder. This adder is 
used to increment the PC as macroinstructions 
are parsed. 

Literals can be introduced into the data path by 
conditionally discharging the precharged A or B 
bus lines. 

The shifter function is implemented as a data 
extractor rather than a full shifter, which would 
require more hardware. The extractor can extract 
32 contiguous bits from a 64-bit field. When the 
values on the input buses are identical, the high- 
order bits appear to wrap around to the low-order 
positions, thus mimicking a full shifter. 

The shifter has the two 32-bit precharged read 
buses (the A and B buses) as inputs and a 32-bit 
output. The shifter is implemented using NMOS 
transistors. The control diagonals are run in 
polysilicon strapped by metal at both ends. 
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Because the RC delay in asserting the control 
lines is long, the control lines are driven before 
the input data is valid. The inputs are then condi- 
tionally pulled low, discharging the outputs. 

The ALU in the data path is capable of addition, 
subtraction, and a variety of logic operations. The 
ALU also includes a 1-bit left/right shifter and 
additional logic to support multiply and divide 
operations. The ALU is implemented using a 
carry-lookahead scheme with propagate and gen- 
erate logic 

The ability to read the register file, do an ALU 
or shift operation, and write the result back into 
the register file all in one cycle is important to 
the machine's performance. This critical path 
was alleviated by partially overlapping the regis- 
ter file write with the next register file read. The 
partial overlap introduces a race between the 
write and the read, but the circuit delay in assert- 
ing the read select line is sufficient to ensure that 
the race is always won without extending the 
cycle time. 

Memory Management Unit 
When memory management is enabled, the M-Box 
uses a fully associative translation look-aside 
buffer (TB) to translate virtual addresses to physi- 
cal memory addresses. The major design goal for 
the M-Box was to achieve a TB miss rate that was 
one third that of the MicroVAX 78032 CPU chip. 
Consequently, we increased the size of the TB 
from 8 to 28 page table entries (FrEs). Further- 
more, we used a more efficient microcode routine 
to reduce the number of cycles required to fetch 
a FIE on a TB miss. A PTE is composed of the 
higher order bits of the physical address, the 
access protection field, and other memory man- 
agement information. In the MicroVAX 78032 
CPU chip, a least-recently-used algorithm was 
employed to replace the PTE on a TB miss How- 
ever, the implementation of this algorithm 
requires complex circuits and a large amount of 
chip area as the TB size is increased. For this rea- 
son, we implemented a simpler but almost 
equally efficient not-last-used algorithm in the 
CVAX 78034 CPU chip. 

To realize a single-cycle cache read operation, 
both a virtual-to-physical address translation and 
a check of the access protection field of the 
PTE must occur in just two clock phases. How- 
ever, there is not enough time to check the 
access protection field after the translation has 



occurred. Therefore, all access protection fields 
in the TB are simultaneously compared to the 
current access type while the translation is in 
progress. This scheme requires that the access 
protection field be fully decoded before it is 
stored in the TB 

In addition to interacting with the cache, the 
M-Box interfaces with the BIU and the I-Box. 
The M-Box contains three registers: the virtual 
address (VA) register, the virtual address prime 
(VAP) register, and the virtual instruction buffer 
address (VIBA) register. After a data read or 
write using VA or VAP, VAP is loaded with the 
most recently used address plus four. In this way, 
VAP can quickly generate sequential longword 
addresses. During a memory operation, the 
M-Box sends the address to the cache and BIU. 
The M-Box will forward data from the E-Box 
during the next cycle if the operation is a write, 
or capture data for the E-Box if the operation is 
a read. 

Whenever there is space available for a long- 
word in the I-Box prefetch queue, the I-Box 
requests instruction stream data. If the M-Box 
does not decode a memory read or write request 
from the current microinstruction, it services the 
instruction stream read request using the virtual 
address stored in the VIBA register. After a 
prefetch reference succeeds, the VIBA register is 
incremented by four in preparation for the next 
prefetch. 

Bus Interface Unit 

The bus interface unit, the BIU, controls external 
chip operations, internal cache access and 
refresh, and arbitration for the internal data and 
address bus. The BIU contains two state 
machines. 

■ The internal state machine controls the arbi- 
tration for the internal data and address bus 
(IDALs). 

■ The external state machine, controls the arbi- 
tration for the external pins and DAJLs. 

The design goal was to achieve a single-cycle 
read operation for hits to the internal cache and a 
two-cycle write operation for an ideal memory 
subsystem. In addition, better system reliability 
is achieved by providing parity protection on all 
the external data transfers and internal cache 
read/write operations. 
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To accomplish a single-cycle read operation, 
the two state machines were implemented as self- 
timed PLAs that require just one phase to evalu- 
ate. The separation of control operations between 
the two state machines allowed the PLAs to oper- 
ate in different phases. Read/write-related, inter- 
nal time-critical signals are generated by the 
internal state machine. This state machine evalu- 
ates first, stalls the CPU if necessary, controls the 
cache, and sets states for the external state 
machine. Time-critical external strobes are con- 
trolled by the external state machine. The exter- 
nal state machine operates next, controls the ter- 
mination of external operations, clears the 
internal state machine flags, and grants control of 
external buses and strobes to external devices. 
On a cache miss, the external state machine 
unconditionally drives the external read data to 
the M Box or the I-Box, and a phase later the state 
machine validates the data. This scheme made it 
possible to service the next microinstruction 
while the previous one was completing. 

The BIU also controls all memory transactions. 
A memory read operation is performed in one 
cycle if there is a hit in the internal cache and no 
cache parity error is detected. However, when a 
cache miss occurs during a read operation, a two- 
longword block in the cache is allocated to store 
the data, which must now be read from memory. 
The BIU stalls the CPU until the first longword of 
data is received. The BIU initiates the external 
read cycle, sending the address of the first long- 
word to the external memory system. When the 
first longword of data is received, the BIU sends it 
to the cache and E-Box or I-Box, and unstalls the 
CPU The fetch of the second longword is over- 
lapped with other chip activity to minimize the 
effective memory access time. The second long- 
word of data is written into the alternate long- 
word in the allocated quadword (two longword) 
cache block. The cache block is validated only if 
both longwords in the block are fetched success- 
fully. 

The BIU contains a longword write buffer 
which supports a dump-and-run write mecha- 
nism. Chip activity, including cache reads, can 
proceed in parallel while the BIU is waiting for 
the completion of a write operation. The BIU may 
have up to three different operations in progress 
at once: a write to memory, a read from memory, 
and an internal cache entry invalidation. Descrip- 
tions of these operations in the BIU follow. 

While a write to memory is awaiting comple- 
tion, the internal state machine can service read 



requests. If the read reference misses the cache, 
it is queued and serviced only after the write 
operation completes. This overlapping of read 
and write operations reduces the number of 
memory stall cycles, resulting in a lower TPI. 

To facilitate support for multiprocessor appli- 
cations and DMA activity, the BIU provides a pro- 
tocol for internal cache coherency. To activate 
this function, an external device first gains own- 
ership of the external address and data bus by 
means of the DMA request and grant protocols. 
The device then presents an address, qualified by 
certain strobes, to the processor. The processor 
latches the address and then performs a cache 
look-up. If a cache hit occurs, the matching 
cache entry will be invalidated. 

Eight pins are dedicated to the floating point 
interface. To optimize the operand transfer rate 
between the CVAX 78034 CPU and its floating 
point processor, both chips read the floating 
point operands from memory simultaneously. 

Cache 

The goals for the design of the internal cache 
were twofold: to reduce the memory access time 
to one microcycle for data that is resident in the 
cache; and to minimize the number of cache ref- 
erences that miss the cache. 

To achieve the one-microcycle access time, the 
internal cache is designed to perform the cache 
look-up in parallel with the translation buffer 
look-up. This scheme uses the 9 virtual address 
bits that do not change during the address transla- 
tion process to index into the array. Because the 
cache look-up and translation buffer look-up are 
performed in parallel, the data for the selected 
cache entry is ready when the translated address 
is being latched into the tag comparator. The 
cache tag is then compared to the translated 
address If a match occurs, the data is driven onto 
the IDA! before the end of the cycle. 

To achieve our second goal — minimization of 
the number of cache misses — we used a two- 
way set associative cache with a block size of 
8 bytes. This two-way set associative cache was 
designed to meet both performance and chip size 
requirements. First, a random replacement algo- 
rithm was selected to reduce circuit complexity 
with a minimal impact on cache performance. 
With reference to chip size, we determined that 
a cache size of 1KB was the largest that could be 
used. In addition, the cache is designed so that 
it can be configured by software to act as an 
instruction-only cache or as an instruction and 
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data cache. The instruction-only option was pro- 
vided to simplify hardware in multiprocessor 
systems where the designers do not want to deal 
with DMA invalidates. 

The cell chosen to implement the cache array 
is a one-transistor (IT) dynamic RAM. The IT 
cell, illustrated in Figure 4, was chosen because 
of its small area A comparable array design with 
either a four-transistor dynamic RAM or a six-tran- 
sistor static RAM cell would have required 2.4 to 
3 times as much area. The storage capacitance of 
the IT cell is 110 femtofarads, resulting in a bit- 
line to cell-capacitance ratio of 8 to 1. With a 
folded bit-line structure and the use of a dummy 
cell (which stores half the charge of the storage 
cell), a voltage differential of 200 millivolts was 
realized at the sense amplifiers. Because of the 
dynamic nature of the IT cell, a refresh counter, 
composed of linear feedback shift registers, was 
designed to control which row is refreshed dur- 
ing idle cache cycles 

We designed byte parity into the cache to 
deteci data corruption resulting from either soft 
or hard errors. A study was done to determine the 
soft error rate of the cell. The soft error rate for 
the cache array was found to be 10 FITs, where 
I FIT is equal to I failure in one trillion operat- 
ing hours. To protect against data corruption due 
to minority carrier injection, the array is sur- 
rounded by a deep N-type implant ring. 

The CVAX CPU chip is the first microprocessor 
in the industry to include an on-chip dynamic 
IT cell cache. 

Control Store and Microsequencer 
The operations and interactions of the five func- 
tional blocks described so far are all controlled 
by microcode in the control store. The micro- 
sequencer supplies the microaddress to the 
control store. The control store contains 
1 .600 words of read-only memory. Each 4 1-bit 
word is divided into a 28-bit field, which controls 
the execution sections of the chip, and a 13-bit 
field, which controls the microsequencer. Con- 
trol store access is achieved in less than three 
clock phases 

The control store is organized into 200 rows of 
8 words each. H-shaped cells, 7 by 8 microns in 
size, are used to implement the array. 

A microaddress is supplied to the control store 
by the microsequencer by means of the 1 1-bit 
microaddress bus (bits 10 through 0) Eight of 
these bits, 10 through 4 and 0, select one of the 



200 rows. Selection of a row causes all eight 
words to be driven onto the precharged bit lines 
which form the inputs of an 8 to I multiplexer. 
The three remaining microaddress bits, 3 
through 1 , choose one of these eight microwords 
to be driven onto the microinstruction bus. The 
final value of bits 3 through 1 can be modified by 
values on the microtest bus. This 3-bit bus con- 
veys state information from other sections of the 
chip to the microsequencer. In this way, various 
processor states may be polled to enable up to an 
eight-way microcode branch 

The primary function of the microsequencer is 
to supply microaddresses to the control store. 
The microsequencer selects a microaddress 
based on microcode control and external control 
from the testability logic. In addition to generat- 
ing microaddresses, the microsequencer receives 
exception request lines from other sections, pri- 
oritizes these requests, and generates base 
addresses for microcode exception service rou- 
tines. These base addresses can be modified by 
the section signaling the exception by means of 
the microtest bus. 

The microsequencer contains a last-in -first-out 
(LIFO) queue of eight microaddress entries 
called the microstack. A latched copy of the 
microaddress bus is stored on the microstack 
when a microcode exception occurs. Once the 
exception has been serviced, this latched copy 
allows reexecution of the microinstruction that 
caused the exception. In the case of a microcode 
subroutine call, the current microaddress is 
incremented and stored on the microstack. This 
forms the address when returning from the sub- 
routine. 

Testability Issues 

As a complex microprocessor chip, the CVAX 
78034 CPU chip has some difficult testability 
issues. A large number of internal state bits and 
buses are not normally visible at the pins of the 
chip. Early in the design process, techniques 
such as level-sensitive scan design (LSSD) and 
built-in self-test were eliminated as possible 
testability strategies. Both of these strategies 
would have had a significant impact on chip area 
and performance. Instead, an ad hoc method of 
design for testability was developed. 9 

The design for testability strategy has two main 
themes. ( 1 ) make maximum use of existing hard- 
ware for test observability and controllability, 
and (2) add special test hardware to those areas 
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of the chip where observability or controllability 
would not otherwise be possible. 

The chip already had some important features 
that could be exploited. 

■ The chip is controlled by the microcode con- 
tained in the control store. Thus, it is an obvi- 
ous candidate for controlling the chip when in 
test mode. 

■ Many of the internal registers are readable and 
writable from the internal buses. By transfer- 
ring this read and write data to the main bus 
that connects to the pins (the DALs), much of 
the internal state can be observed and 
modified 

■ The interface for the floating point coproces- 
sor chip contains a mode that broadcasts a 
value from the internal cache or register file to 
the pins. This mode is also used during test for 
cache and register file observability. 

These features alone were not enough, however, 
and some specialized test hardware had to be 
added. 

■ To make use of the chip microcode in test 
mode, it is necessary to be able to externally 
choose the addresses of the microword to be 
executed. Thus, a test mode was added to the 
microsequencer. In this mode, the micro- 
sequencer ignores its normal choice for the 
microaddress and uses the value from a group 
of pins. 

■ The cache is difficult to test in its normal oper- 
ating mode. To overcome this, a special cache 
diagnostics mode was developed. 

■ Some special test microcode was added to 
allow more efficient testing of some areas. 

■ A few major internal buses were not observ- 
able. Dual mode linear feedback shift registers 
(LFSRs) were added to these buses: the output 
of the I-Box instruction decode ROM, the 
microinstruction bus, and the microtest bus. 

The cache refresh address counter is also 
implemented as an LFSR. 

The dual mode LFSRs allow the data bus to be 
captured and scanned out serially. Alternatively, 
the data can be compressed every cycle using the 
linear feedback technique. The outputs of the 
LFSRs are inputs to another LFSR that combines 



the data to a single-bit output stream. In this 
manner, all of the LFSRs may be observed at once. 
In addition, all of the LFSR outputs are fed into a 
multiplexer that allows any one of the registers to 
be observed. 

The test logic requires only one dedicated test 
pin to select test mode and uses less than 2 per- 
cent of the chip area. Moreover, inclusion of this 
logic does not affect chip performance. When in 
test mode, 3 to 15 other pins are redefined for 
test functions. A 4 -bit test-mode configuration 
register selects which of the LFSRs is to be 
observed, whether the LFSRs will be in scan or 
compress mode, and whether or not test broad- 
cast mode is enabled. 

The Role of Simulation and Modeling 

Complexity was managed and detailed circuit 
behavior was predicted through the use of mod- 
els and simulation. During the design, the chip 
was modeled at five levels of abstraction. As the 
design progressed from concepts to implementa- 
tion, the level of abstraction was refined to reflect 
the increasing detail of the design. 

Choosing the Microarchitecture 
The performance model was the earliest and he 
most abstract of all the models. The performance 
model was used to predict the machine's perfor- 
mance and to quantify the speed advantage of the 
various microarchitectural options under consid- 
eration. Written in PL/I, the performance model 
was driven by trace files. These files consisted of 
streams of opcodes and operand specifiers 
derived by running typical VAX applications pro- 
grams. The psuedo-microcode contained in the 
model approximately modeled memory request 
patterns and microinstruction counts for each 
type of VAX instruction. As we had planned, the 
performance model did indeed help predict the 
machine's TPI. Moreover, the model also helped 
identify performance bottlenecks in the micro- 
architecture. 

As noted in the section Project Goals, perfor- 
mance is inversely proportional to the product of 
the TPI and cycle time. Specifically, the cycle 
time depends on the delay through the critical 
speed circuits. Therefore, to identify the critical 
circuits and determine the propagation delays 
through the circuits, we carried out cycle time 
feasibility studies. SPICE, a circuit-level simula- 
tor, was used in these studies. With the chip die 
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size as a given requirement, we determined the 
microarchitecture of the machine by selecting 
those features that minimize the product of TPI 
and the cycle time. 

Verification of the Microarchitecture 
Once the microarchitecture was defined, a 
detailed specification was written for each sec- 
tion of the chip. Next, an abstract behavioral 
model was written to verify that the specification 
described a VAX CPU. Much more detailed than 
the performance model, this model was con- 
trolled by microcode, ran real VAX code, and 
closely modeled the major chip buses, global 
signals, and clocks. The model was written in 
Digital's DECSIM behavioral modeling language. 
Many microcode and microarchitecture bugs 
were identified and fixed as a result of this behav- 
ioral model testing. 

Logic and Circuit Design 
The detailed logic and circuit design began while 
the abstract behavioral model was being written 
During this phase of the design, SPICE simula- 
tions were used extensively to predict circuit 
behavior. Because SPICE simulates transistor 
behavior in detail, it requires a large amount of 
computer resources. Consequently only critical 
circuits were simulated and these were often sim- 
plified to contain only the essential elements. 
Circuit simulations typically involve tens of tran- 
sistors rather than hundreds or thousands 

Verification of Logic — Gate Level 
The abstract behavioral model had been used to 
verify the specification. Now it was necessary to 
verify the implementation of the specification. To 
make this verification, we wrote a schematic-level 
behavioral model that captured the logical and 
timing characteristics of every schematic. Almost 
every node was modeled explicitly. This essen- 
tially gate level model was also written in the 
DECSIM language The model identified many 
logic and timing bugs, especially between sche- 
matics designed by different engineers. 

The schematic-level behavioral model was sub- 
jected to intensive verification because it offered 
a good compromise between implementation 
detail and simulation efficiency. This model of 
the CVAX 78034 CPU chip was used by the sys- 
tem designers in other design teams to model the 
interaction of the CPU with other chips in board 
designs. 



Verification of Logic — Transistor 
Level 

The DECSIM simulation tool also supports MOS 
transistor level modeling. We used this tool as a 
switch-level simulator, that is, we modeled tran- 
sistors as open or closed switches. The model was 
automatically generated from the schematic data- 
base. 

This level of modeling reflected the true behav- 
ior of the schematics with greater subtlety than 
the schematic-level behavior model. However, 
this model was not nearly as computationally 
efficient as the behavioral model. 

DECSIM MOS modeling identified sequencing 
errors, charge sharing problems, sneak paths, and 
race conditions that the more abstract models had 
failed to detect. 

Physical Technology 

The CVAX 78034 CPU chip is implemented in a 
P-EPI, N-well CMOS (complementary metal- 
oxide-semiconductor) process developed in- 
house. The process has two layers of aluminum 
interconnect and a single layer of polysilicon. 
The critical process dimensions and chip charac- 
teristics are summarized in Table 2. 

The chip contains 180,000 transistor sites 
with 134,000 actual transistors, and measures 
9 7 mm by 9.4 mm on a side. (See Figure 2.) It 
is packaged in an 84-pin surface-mountable 
ceramic chip carrier with 50-mil leads, uses a 
single +5 volt supply, and has a worst-case 
power dissipation of 1 5 watts. 



Table 2 CVAX Chip Process 



Fabrication Process 



Fabrication process 


CMOS 


Gate oxide 


300 A 


Substrate 


N-well in P-EPI 


Device types 


N-channel enhancement 




MOSFET; 




P-channel enhancement 




MOSFET 


Interconnect Pitches (Line/space Drawn) 


Polysilicon 


2 micron/2 micron 


Metal 1 


4 micron/2 micron 


Metal 2 


5 micron/2 micron 


Contacts 


2 micron/2 micron 
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Figure 5 Micro VAX 3500/3600 and Micro VAX II System Benchmark Comparison 



Summary 

The CVAX 78034 CPU chip met the project 
design goals. Depending on the benchmark or 
application program being run, the performance 
of the MicroVAX 3500/3600 systems is 2.6 to 
4.1 times that of the MicroVAX II computer. 
(Refer to Figure 5.) This performance increase 
was achieved by reducing both the TPI and the 
machine cycle time. 

The main factors influencing TPI are the 
1KB, on chip cache; the 64KB on board cache; 
and the 28-entry virtual-to-physical address trans- 
lation buffer. The cycle time was reduced as a 
result of the advanced process technology chosen 
and the architectural and circuit innovations 
made by the design team. 
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Development of the 
CVAX Floating Point Chip 

The CVAX floating point accelerator ( CFPA ) chip is a CMOS floating point 
coprocessor for the CVAX system. The purpose of the CFPA project was to 
provide gains in floating point performance equal to those of the CVAX 
CPU for integer performance. Combined with an aggressive schedule, the 
primary goals required the CFPA chip to perform at three times the level of 
the previous generation MicroVAX floating point unit (FPU) and to be 
complete two years after delivery of the MicroVAX 11 system. Designers 
obtained a performance gain of only 25 percent through base technology 
improvements. Consequently, most gains are achieved through the use of a 
multiplier array, improved arithmetic algorithms, and a fast and efficient 
interface with the CPU. 



Functional Overvieiv 

The CFPA VLSI chip is the companion floating 
point processor for the CVAX CPU The chip's 
hardware structures and algorithms provide high 
overall system performance. In all, the chip exe- 
cutes 76 instructions. 
The CFPA supports 

■ Three VAX floating point data types: 
F_floating, D_floating, and G_floating 

■ Floating point calculations, which include a 
polynomial evaluation instruction 

■ Integer multiply and divide instructions 

■ Conversion between integer and floating point 
data types 

■ Complete detection of all exception condi- 
tions 

The CFPA operates synchronously with the 
CPU at speeds of 80 and 90 nanoseconds (ns) per 
cycle. Opcode, control, and status information is 
communicated between the coprocessor and the 
CVAX by means of a dedicated 8-bit bidirectional 
coprocessor bus. 

Table 1 lists the CFPA physical characteristics 

CFPA Project Goals 

The two main goals of the CFPA chip design pro- 
ject were ( 1 ) to provide the CVAX system with an 
improvement in floating point performance to 



Table 1 CFPA Physical Characteristics 

Number of transistors 65,000 

Package 68-pin surface-mountable 

chip carrier with 50-mil lead 
spacing and heat sink 

Die size 7.3 mm x 9.1 mm 

Power dissipation 1 W 

Fabrication process 2 micron drawn, N-well, 
dual aluminum CMOS 



equal the central processor chip's expected per- 
formance level for integer operations, and (2) to 
adhere to the same development schedule set for 
the CVAX CPU chip. Specifically, these goals 
required instruction execution times to be three 
times faster than the MicroVAX FPU on average. 
Further, the schedule allowed little time to 
achieve these significant performance gains; the 
design would have to be completed only two 
years after the MicroVAX II system design. 

In order to improve computer performance, 
the clock frequency and/or the amount of work 
completed in a cycle must be increased. The 
CVAX CPU uses the improved speed characteris- 
tics and greater density of the CMOS process to 
reduce the clock cycle time from 200 ns in the 
MicroVAX II design to 80 or 90 ns. A pipelined 
architectural approach was necessary to achieve 
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this reduction. In particular, while the arithmetic 
and logic unit (ALU) operates on one microin- 
struction, the register file is free to access data 
for the next microinstruction. This improvement 
allows more work to be completed in each micro- 
cycle and offers a reduction in the cycle time as 
well. 

The previous generation floating point design, 
used in the J-l 1 FPA as well as the MicroVAX II 
and VAX 8200/8300 systems, already pipelined 
register file access with ALU operations. This 
pipelining was necessary to allow a 1 00-ns cycle 
time — twice the frequency of the companion 
CPUs — in the ZMOS process technology. Since 
the pipelined register/ALU operation was already 
achieved, the improvement in cycle time for 
the CFPA is limited by the speed of the ALU and 
does not benefit from additional pipelining. The 
improved technology allowed for an ALU imple- 
mentation that provides a 20 percent decrease 
in cycle time, matching the CVAX microcycle 
Therefore, the necessary performance increases 
for the CFPA would not be created by scaling the 
cycle time. Instead, CFPA designers would make 
improvements in the amount of work done per 
microcycle and in the interface between the pro- 
cessor and the floating point chip. This interface 
is described in the following section. 

An overview of the chip's overall performance 
is presented in the section CFPA Performance at 
the end of this paper. 

Processor-to-bus Interface 

In addition to the CVAX system bus used to trans- 
fer floating point data, a dedicated 8-bit bidirec- 
tional coprocessor bus is used to communicate 
between the CVAX and the CFPA An example of the 
CFPA system configuration is shown in Figure 1. 
The CFPA normally monitors the coprocessor bus 
for opcode and operand information until it is 
ready to drive a result back to the CVAX. After 
decoding an opcode, the CFPA monitors control 
signals on the bus that indicate the presence of an 
operand Operands may come from a CPU general 
register, internal cache location, or from the 
memory system. When operands arc transferred 
from CPU general registers or internal cache 
locations, the data is transmitted directly 
between the CVAX and the CFPA. Operands from 
external memory or cache locations are indicated 
on the coprocessor bus at the start of the external 
memory access. The CFPA then monitors the 
CVAX system bus and latches the returning data 
without CVAX intervention. 





CPU 




FPU 








t 



CVAX BUS 



MEMORY 



SECOND-LEVEL 
CACHE RAM 
64KB 



Figure I CFPA Example System 
Configuration 



After supplying operands to the CFPA, the 
CVAX relinquishes control of the coprocessor bus 
to receive the result status of the floating point 
operation. Control of the coprocessor bus, how- 
ever, does not imply control of the CVAX system 
bus. The CFPA ensures availability of the CVAX 
system bus by monitoring the direct memory 
access (DMA) grant signal from the CVAX. If a 
DMA has been granted, the floating point result 
status will be retransmitted until the DMA opera- 
tion is complete. Receipt of the floating point 
status while the DMA grant signal is deasserted 
guarantees availability of the CVAX system bus 
for the next cycle. Control of the coprocessor bus 
is returned to the CVAX after successfully driving 
floating point status. The CFPA drives the result 
data on the CVAX system bus one cycle later, 
completing the operation. 

Floating point instruction latency comprises 
overhead devoted to opcode, operand and result 
transfer, and actual computation, or execution 
time. Due solely to improvement in CVAX cycle 
time — from 200 ns in MicroVAX systems to 
80 or 90 ns in CVAX systems — overhead times 
are improved by factors of 2.5 or 2.2, respec- 
tively Designers achieved additional improve- 
ments in the interface by reducing the actual 
number of cycles required for these overhead 
transfers. As compared to the MicroVAX II sys- 
tem, the CVAX system requires fewer cycles to 
access and transmit register and internal cache 
operands located on the chip. Moreover, external 
cache and memory operands are input directly 
from the CVAX system bus as opposed to being 
fetched by the CPU and later retransmitted to the 
FPU as in the MicroVAX II system. The resulting 
interface improves performance by a factor of 
approximately 2 5 (90-ns cycle) to 2.8 (80-ns 
cycle) over the MicroVAX II system. 
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Despite these improvements, more than half 
the cycles required to execute a floating point 
instruction in the CVAX system can still be 
attributed to overhead costs. The possibility of 
pipelining macroinstructions — overlapping the 
operand fetches of the next instruction with exe- 
cution of the current instruction — as well as 
operand forwarding was studied. In such a system 
the effective instruction time is determined by 
the longer of the operand transfers or the actual 
floating point execution time. Instruction time is 
not determined by the additive effect of the inter- 
face and execution. The one-instruction macro 
pipeline interface was rejected due to the risk 
and complexity of the design. Moreover, perfor- 
mance goals had already been met and develop- 
ment time was at a premium. 

Algorithms 

Although the interface figures prominently in the 
achievement of overall performance targets, most 
of our design eff orts were focused on the actual 
execution unit. To maintain and even increase 
the benefits gained by the interface design 
improvements, we needed an equal or greater 
improvement in execution times. Since the most 
important instructions for a floating point unit 
are addition/subtraction, multiplication, and to a 
lesser extent division, designers set about opti- 
mizing these instructions. The remainder of 
instructions implemented by the CFPA also 
benefit from the shift, multiply, and divide opti- 
mizations and demonstrate performance gains 
relative to the MicroVAX II FPU as well. Finally, 
all instructions gain from microcode improve- 
ments in atypical case handling and from faster 
code entry and exit techniques. 

Multiplication 

Floating point multiplication consists of multipli- 
cation of the fractional, or mantissa, portions of 
the operands and the summation of the corre- 
sponding exponents. Many multiplication tech- 
niques have been developed and implemented to 
increase the speed of this frequently executed 
instruction. Perhaps the best technique for VLSI 
implementation at this time is the multiplier 
array. The array is particularly well suited for 
VLSI implementation due to the array's regularity 
of circuit connections which allow for a very 
compact and repeatable cell design. 

The process of multiplication involves a series 
of additions. It is possible to delay the carry prop- 
agation necessary to complete these additions 



until the final sum is formed through the use of 
carry save adders. Multiplier arrays consist of 
rows of carry save adders which add in a new mul- 
tiple of the multiplicand at each row. The carry 
save adders produce a result, or partial product, 
consisting of two outputs, the carry and the sum; 
if added, the two outputs represent a single num- 
ber equivalent to the partial product at that 
step obtained using full propagation addition. By 
deferring the final summation of the sum and 
carry words, the comparatively time-consuming 
carry propagation addition need be performed 
only once to produce the result. 

The only drawback to the multiplier array is 
the large percentage of chip area devoted to this 
one operation. Nevertheless, the magnitude of 
performance gain warrants the use of an array in 
any high-performance computation unit. 

Another common method used to improve the 
processing of multiplications involves multiple- 
bit Booth encoding. This method, which requires 
significantly less hardware, is aimed at reducing 
the number of partial products needed to be 
formed. The multiplier operand is encoded — or 
recoded — as a control pattern used to deter- 
mine a sequence of shift and add or subtract oper- 
ations on the multiplicand. Multiple bits of the 
multiplier can then be retired in a single opera- 
tion. This method of reducing the number of 
multiplication steps can be employed either with 
or without an array structure. 

The previous generation MicroVAX FPU exe- 
cutes multiplication using a fixed, 3-bit-per-cycle 
Booth algorithm without the use of a multiplier 
array. Single-precision multiplication requires 8 
cycles to compute 25 product bits; D_floating 
and G_floating double-precision formats require 
19 and 18 cycles to produce the necessary 57 or 
54 product bits. Additional cycles are needed to 
set up the multiply loop, calculate the initial par- 
tial product based upon the multiplier least- 
significant bit (LSB), and round and normalize 
the final product. 

The CFPA multiply algorithm takes advantage 
of the greater density and transistor count 
afforded by the CMOS process. The CFPA imple- 
ments a multiplier array, which consists of four 
rows of 65 carry save adders. The multiplicand 
select logic associated with each row of the array 
as well as the interconnect between the rows is 
configured to implement a 2-bit Booth encoding. 
As a result of this configuration, 8 product bits 
are completed per pass through the array. Single- 
precision multiplication requires three passes 
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through the array, and double-precision requires 
seven passes to complete. 

The array can be evaluated twice per cycle. 
Therefore, single-precision multiplication re- 
quires one and one-half cycles, and double-preci- 
sion D_floating and G_floating formats require 
three and one half cycles of processing in the 
array. Before running the array, one-half cycle is 
needed for set up and initial product calculation. 
After the multiplier array completes, a cycle is 
used to complete the full carry propagate add, 
which combines the final carry and sum outputs 
of the array. This cycle is followed by a normal- 
ization cycle during which valid status is 
returned to the CVAX. 

When we compare the Micro VAX II system to 
the CFPA, the number of cycles required to com- 
plete a MULF instruction has been reduced from 
14 to 4 (a ratio of 3 9 to 1 at 90 ns, 4 .4 to 1 at 80 
ns); to complete MULD or MULG instructions, 
the reduction is from 26 to 6 (4.8 to 1 at 90 ns, 
5 4 to 1 at 80 ns). If we include operand transfers 
and count each interface cycle of the MicroVAX U 
system as equivalent to two CVAX cycles, how- 
ever, the reduction in the total number of cycles 
for MULF is from 27 to 9 (3 3 to 1 at 90 ns, 3 8 to 
1 at 80 ns); and for MULD, from 4 3 to 14 (3.4 to 
1 at 90 ns, 3 8 to 1 at 80 ns) for register-mode 
instructions. When operands are read from or 
written to memory, the overhead support per- 
centage becomes an even greater factor; and the 
impact of the actual CFPA multiplication speed is 
reduced. 

To further increase performance, we consid- 
ered an array of sufficient size to complete single- 
precision multiplication in a single pass and 
double-precision multiplication in two passes. 
However, such an array would require three 
times the chip area for a 2-bit algorithm. A 3-bit- 
per-row multiply would require 8 rows to com- 
plete single-precision multiplication in one pass 
and 9 or 10 rows to complete double-precision 
multiplication in two passes, as well as an adder 
to calculate the multiplicand factor of 3 Either of 
these alternatives, if feasible, would save only 
one cycle in single-precision (a reduction from 9 
to 8, or 1 1 percent) and two cycles in double- 
precision multiplication (14 to 12, or 14 per- 
cent). In addition to the area requirements, the 
circuit design difficulty and risk involved to 
implement a larger array were deemed much too 
great for the limited gains. We therefore chose to 
trade off these smaller gains in favor of a partial 



array of 4 rows of 2-bit-per-row retirement re- 
quiring only 1 3 mm of chip height. The result is 
a three and one-half to four times gain in the over- 
all performance of multiplication. 

Addition /Subtraction 

Floating point addition involves a series of steps. 

1 . The exponents are subtracted to determine 
the shift amount necessary to align the frac- 
tions. 

2. The fraction operand with the smaller 
exponent is shifted into alignment and 
added or subtracted. 

3- The result is shifted back to the normal ized 
form (^ result < 1.0). Normalization 
shifting is accompanied by exponent adjust- 
ment. 

4 The result is rounded and checked for 
overflow or underflow conditions. 

Typically, the shifting operations and their con- 
trol consume large amounts of chip area and 
potentially a large portion of the total calculation 
time. An analysis of these operations was used to 
guide trade-offs in the design of the CFPA.' It was 
noted that although large shifts are sometimes 
necessary to compute the final result, their fre- 
quency of occurrence is very small. Furthermore, 
a small shifter, capable of covering the vast 
majority of cases in a single operation provides 
the benefit of a small control circuit that can be 
more easily optimized for speed. It was decided 
that the speed and area advantages gained by 
designing for the most frequently occurring cases 
provided the best solution under project con- 
straints 

Specifically, a small shifter that is capable of 
left-four to right-seven bit shifts proved to have 
adequate range for most alignment and normal- 
ization shifts. In up to 80 percent of the cases, 
additional cycles arc not needed for alignment 
shifting. Larger alignment shifting utilizes the 
multiplier array for a shift capability of 16 bits 
per cycle. The array minimizes the worst-case 
shift time without requiring a large shifter. 
Although it rarely requires additional cycles, nor- 
malization shifting may cause a longer latency. 
Additional cycles, however, are not necessary for 
normalization in 93 percent of the cases. 

To reduce the shifter control complexity, a 
modified ALU calculates the absolute value of the 
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exponent difference. The modified ALU does not 
require additional calculation time to accom- 
plish this calculation. The absolute value result 
simplifies control logic to enable the alignment 
shifter to complete in the next clock phase. Only 
one additional generate term is needed to enable 
two carry chains executing simultaneously; one 
calculates A minus B, the other B minus A. The 
most significant bit (MSB) of the first carry chain 
determines the sign of the operation To produce 
the absolute value or positive result, the MSB of 
the first carry chain is used to select the final out- 
put from the two carry chains. In addition, the 
MSB is used to select the fraction requiring 
alignment. 

The CFPA completes addition or subtraction 
operations in three cycles for most cases. This 
minimum execution time is exceeded for only 
25 percent of all addition or subtraction opera- 
tions, almost all of which require only one addi- 
tional cycle. 

The major improvement over the MicroVAX II 
FPU in the addition/subtraction algorithm is the 
elimination of no operation cycles necessary for 
control evaluation preceding the alignment and 
normalization steps. The resultant reduction as 
compared to the MicroVAX II FPU is from eight 
cycles to three for both single- and double-preci- 
sion additions/subtractions in the actual floating 
point unit calculations (3 to 1 at 90 ns, 3 3 to 
1 at 80 ns). 

The overall performance gain in equivalent 
cycles is 20 to 8 for single-precision (2.8 to 1 at 
90 ns, 3. 1 to I at 80 ns) and 26 to 1 1 for double- 
precision addition/subtraction (2.6 to 1 at 90 ns, 
3.0 to I at 80 ns). 

Division 

Floating point division consists of division of 
the fraction or mantissa and subtraction of the 
exponents. Division presents a more intractable 
problem than multiplication when designing for 
high-speed performance. The difficulty arises 
due to the fact that the partial remainder at each 
step must be examined before the next operation 
can be determined. Various algorithms have been 
proposed to reduce the number of arithmetic 
steps, but no single solution seems to optimize 
both performance and size constraints. 

The CFPA uses a method of division that offers 
an improvement over single-bit division algo- 
rithms, which perform an arithmetic operation 
to produce a single quotient bit per step. The 



method calls for shifting over, or normalizing, 
multiple leading bits when the partial remainder 
is small. A partial remainder with multiple lead- 
ing ones indicates a small negative remainder, 
whereas leading zeros indicate a small positive 
remainder. Multiple quotient bits can be deter- 
mined for cycles in which the magnitude of the 
partial remainder is small. Shift operations 
replace arithmetic operations on unnormalized 
remainders, reducing the number of ALU cycles 
needed to develop the final quotient. This 
method of division is called normalizing, non- 
restoring division and is also used in the 
MicroVAX FPU. The difference between the two 
implementations is in the normalization shift 
range provided for partial remainder and quo- 
tient development. 

Of course, this algorithm is quite data sensi- 
tive. A division that results in a partial remain- 
der of all ones or all zeros can be completed 
in a minimum amount of time; whereas, if a 
string of alternating ones and zeros is produced 
at each ALU operation, the process degener- 
ates to a one-bit-per-cycle pace. The observed 
average rate for an algorithm that allows 
unlimited shift range is 2.66 bits per cycle. 
Unfortunately, the shift range chosen implies a 
control structure directly between the shift 
and ALU operations. The time between these 
operations is critically important to the over- 
all cycle of the chip. We chose 4 bits as the 
left shift range for the CFPA to reap the maxi- 
mum benefit from the technique without intro- 
ducing inordinately difficult control paths 
between the shift and ALU operations. This 
amounts to an increase of 2 bits of shift range 
over the MicroVAX FPU. Correspondingly, the 
average number of quotient bits developed 
each cycle increased from 1.5 to 2.4. Expand- 
ing the shifter beyond a range of 4 for this 
method provides a diminishing improvement, as 
shown in Table 2. 



Table 2 Average Quotient Bits per Cycle 


Shifter Range 


Average Speed 


2 


1.5 


4 


2.39 


6 


2.54 


8 


2.64 


Unlimited 


2.66 
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Increasing the number of quotient bits devel- 
oped per cycle from 1.5 to 2.4 results in 
increased speeds in the CFPA divide loop relative 
to the MicroVAX FPU: 1.8 times greater for 
90-ns cycles, and 2.0 times greater for 80-ns 
cycles The overhead cycles involved in setting 
up the divide sequence and normalizing the quo- 
tient arc reduced from 7 to 2. As a result, the 
CFPA realizes a performance greater than the 
MicroVAX II FPU in terms of number of cycles 
reduced for division. Including the processor-to- 
FPU interface cycles, the number of cycles for 
single-precision division is reduced from 37 to 
18 cycles (2.3 at 90 ns, 2.6 at 80 ns); for 
D_floating double-precision division, 6l to 35 
(1.94 at90 ns, 2.2 at80 ns). 

Comparatively, this method of division is very 
efficient, especially when we consider the small 
amount of control circuitry and data path area 
required. Designers can increase performance 
additionally by using algorithms that employ 
multiples of the divisor, or by implementing a 
divider array structure The use of multiples of 
the divisor requires both additional registers to 
hold the multiples (3/4, 1, 3/2) and further 
expansion of the left shift capability to take 
advantage of the longer normalizations created by 
this approach (3 6 bits per cycle with left shift 
range expanded to 6) In addition, the control 
logic required to support the selection of the 
proper multiple is more complex and would be 
much more difficult to implement in the con- 
strained cycle time The other alternative of exe- 
cuting the divide step in an array structure for 
performance capable of 3 to 4 quotient bits per 
cycle involves an even greater cost in hardware 
and is not consistent with the project goals. 

Integer division docs not automatically bene- 
fit from hardware devoted to floating point divi- 
sion. Since floating point division relies on 
the normalization of the operands, integer divi- 
sion must either convert operands to the normal- 
ized form or accept a slower one-bit-per-cycle 
algorithm The CFPA design for integer division 
normalizes both the divisor and dividend in 
order to use the 2.4-bit-per-cycle divide algo- 
rithm. Normalization of the divisor and dividend 
proceeds at 5 bits per cycle. The number 
of quotient bits needed to complete the integer 
division operation is determined by the differ- 
ence between the normalization shift amounts 
of the divisor and dividend. Consequently, 
integer divides arc typically executed at 



2 .5 bits per cycle as compared to 1 bit per cycle 
on the MicroVAX FPU. 

Microcode Control Structure 

The control structure for the CFPA is influenced 
by two opposing constraints. The complicated 
requirements of instructions such as extended 
multiply and integerize (EMOD) and polynomial 
evaluation (POLY) require the flexibility offered 
by a microcoded approach Performance goals, 
however, require the speed of hardwired control 
structures to avoid costly delays incurred during 
microcode branch handling. The final imple- 
mentation combines a small control Pl-A (pro- 
grammable logic array) to provide the flexibility 
of microcode control with hardware control 
structures for speed critical paths. These control 
structures are enabled through the microcode to 
emulate complete hardwired control for impor- 
tant instructions. The structures provide support 
for alignment, normalization, multiplication and 
division steps. Standard microcode control sup- 
ports the less critical instructions 

Functions are performed under more straight- 
forward microcode control when the code does 
not penalize the instruction performance. This 
trade-off simplifies critical circuitry in some 
instances. The only exception to this rule is in 
the handling of exception conditions. If an 
exception condition can be isolated from the nor- 
mal instruction flow, it is also processed in 
microcode rather than through the more expen- 
sive hardware control . 

The use of hardware structures reduces 
the total number of microcode terms needed 
to implement the instruction set. This reduc- 
tion is important to ensure that the microcode 
PLA can be implemented with an access time 
of one half cycle Instructions generally use 
one code flow for all data types. In addition, 
similar instructions merge sections of flows to 
further minimize terms. For example, the add- 
compare-and-branch (ACB) instruction, which is 
one of the more complicated instructions imple- 
mented by the chip, required only three addi- 
tional terms beyond the addition and compare 
instruction flows Despite this effort, almost 
one third of the code was devoted exclusively 
to two instruction types, EMOD and POLY. 
By splitting, or "folding," the PIA into two 
half-height interleaved arrays, the target speed 
was met with a penalty of only a few dupli- 
cated terms. In total, 76 VAX floating point 
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as well as integer multiply and divide instruc- 
tions are implemented in the CFPA. In compari- 
son to the MicroVAX FPU, the total number of 
microcode states was reduced by 20 percent, to 
only 159, 

Mi crop rogra turning 

As mentioned earlier, the use of hardware 
support contributes to improved performance 
for most instructions. However, since the CFPA 
cycle time during execution is very similar to 
that of the MicroVAX FPU (80 or 90 ns versus 
1 00 ns), we needed further improvement to meet 
the project goals Algorithmic improvement in 
the convert-rloating-to-intcger (CVTFI) and 
EMOD instructions provides between three and 
four times the performance of the MicroVAX FPU 
for the same instructions. But these gains would 
hardly translate to improved overall performance 
when considering the frequency of use for 
these instructions. Therefore, to reduce cycles 
for all instructions, we examined transitions 
during code entry and exit with internal proces- 
sing Since the CFPA always receives the opcode 
in advance of the operands, it is possible to 
reduce the execution time for all instructions 
by performing the first step of each operation 
repeatedly in anticipation of receiving the last 
operand. In this way, as soon as the interface 
recognize that the operand is valid and the 
control sequencer is able to act on that informa- 
tion, the first step of the instruction is already 
complete. 

In the CVAX system, as in the MicroVAX II sys- 
tem, floating point status must be returned before 
data can be received. One reason for this return 
of status is that it prepares the write path back to 
the general-purpose register file located on the 
CPU chip. Status conditions must be checked 
before the result register is written; the register 
update can thus be inhibited in the case of an 
error or exception condition. Latency was 
reduced on almost all instructions by transmit- 
ting the result status back to the CVAX CPU in the 
same cycle as the last step of execution. This is 
accomplished by checking the result prior to the 
last normalization or round operation in order to 
determine if the possibility of an exception con- 
dition exists Since F_floating and D_floating 
formats use an exponent with a range of 256 val- 
ues, and G_floating format increases that range 
to 2,048 possible values, the exponent is in 
range for most results, and a no exception status 



can be returned prior to determination of the 
final result. 

CFPA Implementation 

After deciding on a set of basic algorithms that 
appeared to meet the project goals, the develop- 
ment effort proceeded to actual implementation. 
Individual algorithms can sometimes result in a 
proposed hardware solution that requires modifi- 
cations to either the hardware or to the algorithm 
in order to be implemented within design con- 
straints. Merging the requirements of several 
algorithms can create implementation conflicts 
throughout the physical design. Care must be 
taken to consider the opposing requirements 
while incorporating the necessary features in a 
single design. The algorithms for the CFPA were 
chosen with a single hardware microarchitecture 
in mind. That architecture evolved as the design 
progressed, but the architecture maintained the 
basic structure that was used as a framework for 
early circuit design and feasibility study. The fol- 
lowing section outlines the overall hardware 
microarchitecture for the CFPA. This section is 
followed by explanations of the more interesting 
circuit design issues. 

M icroarchitecture 

The CFPA contains two main functional units: 

■ The execution unit, which performs all arith- 
metic calculations 

■ The bus interface unit (BIU), which controls 
all I/O operations 

A block diagram of these units is shown in 
Figure 2 

The execution unit consists of two main data 
paths and their associated control logic. The 
65-bit fraction data path contains an integral 
multiplier array and also processes integer data. 
Also included in the fraction data path are a small 
4-bit left to 7-bit right shifter, a general-purpose 
ALU, scratch register, ROM constants, and quo- 
tient register and shifter. The second data path, 
the exponent data path, is 1 3 bits wide and con- 
tains a modified ALU design used to calculate 
absolute values needed in floating point addition. 
The exponent data path operates in parallel with 
the fraction data path and may be controlled inde- 
pendently or conditionally based upon results 
from the fraction data path. A 160-term PLA, 
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Figure 2 CFPA Block Diagram 



which accesses a single 44-bit microword each 
cycle, controls the execution unit 

The BIU controls the interface between the 
CPU and memory system A 70-term PLA in the 
unit controls all I/O transactions between the 
(.VAX and CFPA. The BIU also controls the test- 
mode logic to allow visibility to the data paths 
and execution unit PLA during operation. 

Figure 3 illustrates the physical layout of these 
structures on the CFPA die 

Circuit Design 
Clocking 

The CFPA chip employs a four-phase overlapping 
clocking scheme which provides timing resolu- 
tion. Much of the control circuitry design calls 
for combinational circuits that operate between 



latches clocked on nonconsecutive phases, which 
are nonoverlapping 

Multiplier 

As noted in the section Multiplication, it was rec- 
ognized early in the chip design that the multi- 
plier array would be key to meeting the desired 
performance The CFPA implements multiplica- 
tion by using an array of carry save adders with 
partial product wraparound. The wraparound 
enables the array to be cycled as many times as 
necessary. The final carry and sum addition is 
executed in the fraction ALU. A static implemen- 
tation of the carry save adders is necessary since 
data propagates through multiple rows of the 
array 

To build the carry save adders, we used a four- 
transistor XOR This approach allowed for mini- 
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mum delay and required the least amount of chip 
area As a result of SPJCF. simulation, we found 
that doubling the minimum size of the transistors 
in the multiplier array could provide a 20 per- 
cent speed increase. Since the cell area was con- 
strained by the necessary interconnect in the 
metal layers, the device sizes were increased 
without affecting the cell size Further device 
size increases, however, would have forced us to 
increase the cell size and would not have 
improved speed appreciably due to increased 
self-loading With the approach we chose. SPICE 
simulation showed a worst-case delay of 6 5 ns 
per row and a typical delay of 4.5 ns. 

To obtain the desired multiplication perfor- 
mance and minimize the area necessary for the 
multiplier array, we used a technique in which 
the array is cycled twice per microcycle. For 



worst-case devices, a half cycle takes 4 5 ns. An 
array size of four rows takes 26 ns to propagate 
through the array, allowing 19 ns for latching, 
return of partial products, and control switching. 
For typical devices four rows complete in 18 ns, 
allowing 22 ns in an 80-ns cycle for the 
wraparound path 

Control PLA 

We also recognized the fraction shift control PLA 
as a possible speed limitation. The shift control 
PIA was the largest PIA in the control section and 
had to evaluate in a single clock phase. Because 
no clock signals were available to control evalua- 
tion of the PIA, we used a "dummy" AND array 
term to start evaluation of the OR array A 
"dummy" OR line controls output clocking, mak- 
ing the PIA self-timed. Because this PIA could be 
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evaluated in a single clock phase, both alignment 
and normalization operations were able to elimi- 
nate an unnecessary wait cycle present on the 
MicroVAX FPU. We were also able to expand the 
divide algorithm to 4 bit shifts per cycle. 

As we had suspected, the limiting factor in the- 
rmal chip cycle time was the multiplier array. 
The ALUs and the large control PLAs in both the 
microcode control section and the BIU easily met 
speed requirements in the CMOS I process. 

Design Methodology 

As VLSI technology improves, both chip area and 
density increase, allowing much larger and more 
complicated designs to be attempted. Critical to 
any large project, the ability to predict and adjust 
the design according to the most current infor- 
mation plays an i mportant role in achieving a suc- 
cessful project outcome in a minimum of time. 
This section describes the various phases and 
feedback paths of the design process for the CFPA 
and some of the unique aspects of VLSI design. 

In the first phase of design, we defined the 
major sections and the necessary global signals 
communicating between them. The major out- 
puts of this phase were hand-drawn sets of notes 
on the necessary functions of each section and 
preliminary sketches of possible implementa- 
tions. Early in the design, we recognized that cer- 
tain subsections would be critical to meeting the 
desired performance goals These particularly 
critical sections were 

■ The multiplier array 

■ The exponent input path 

■ The fraction shifter controls 

We therefore generated more detailed prelimi- 
nary designs for all of these sections. Moreover 
we tested their feasibility with SPICE circuit sim- 
ulations The MSB and LSB logic in the multiplier 
was also verified with an APL language simulation 
of the multiplier array. 

One of the hazards in the early stages of a pro- 
ject is the tendency to spend too much effort per- 
fecting one small piece of the design. If the origi- 
nal requirements are modified at a later date, 
much time is wasted. The design team, therefore, 
made a conscious effort to keep all parts of the 
design at similar levels of detail at all times 
throughout the project. 

For purposes of design checking and chip 
implementation, we divided the CFPA into seven 
major sections: fraction data path, fraction data 
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path controls, exponent data path, exponent data 
path controls, m icrosequencer, bus interface 
unit, and clock generator. Consistent divisions 
and global signals between these major sections 
were maintained in both the behavioral and tran- 
sistor modeling levels as well as in the final mask 
artwork. This approach allows maximum possi- 
ble checking to be carried out on each section, 
independent of the state of other sections of 
the chip. 

Upon completion of the initial design concep- 
tion, a behavioral model was written in the 
DECSIM simulation language. This model helped 
us to refine the algorithms and further define the 
data path and control structures. We rewrote the 
model several times to improve detail and incor- 
porate design changes. From early in the develop- 
ment, the behavioral model was merged with 
the CVAX CPU chip model and a small system envi- 
ronment to provide a platform for more extensive 
testing Existing diagnostic programs were there- 
fore able to be run on the model to provide early 
checks on the design integrity. Additional tests 
were written to verify specific features of the 
CFPA implementation before we began the 
detailed circuit design for critical sections 
Throughout the development phase, we used the 
VAX Architectural Exerciser (AXE) extensively 
to test instruction compatibility with existing 
VAX implementations. Despite a degradation of 
approximately 1M : I while using the simulator 
to run test code, well over 500,000 test cases 
were run on the behavioral model before the 
design was considered ready for fabrication. 

Using the DECSIM MOS device simulation sys- 
tem, we created a transistor-level model from 
final schematics as they were completed. By col- 
lecting test patterns from the appropriate signals 
in the behavioral model, the team could begin to 
debug the schematic in complete sections as 
other sections were still being designed. To do 
this efficiently, the DECSIM group modified their 
simulator to allow designers to write a binary 
state file and reload the file for examination. This 
faci lity gave logic designers a very efficient means 
to debug the transistor-level logic. Designers 
could run their simulations in batch mode over 
night, examine the resulting patterns for mis- 
matches with the behavioral model results, and 
then "back up" to the area before the failure test 
point to find the underlying cause. They could 
perform all these steps without rerunning the 
entire simulation each time they wanted to go 
back in time to look at another signal . 
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As each section of the transistor-level sche- 
matic was developed to a satisfactory level of 
accuracy, the third phase of the design — 
creation of the physical layout artwork — began 
on that section To create the artwork, a Calma 
GDS interactive editing system was used. Over 
the course of the project, three layout designers 
were employed full time. Toward the end of the 
layout phase, up to four additional designers 
were working on various parts of the chip. Kach 
section was checked with the interconnect 
verification (IV) wirelist extraction tool and a 
design rule checker ( DRC) program. 

As all the sections were drawn and global inter- 
connect wiring was added to the chip layout, the 
fourth phase of the design — the back end 
checks — began. The IV program was used to 
extract actual capacitance values for all nodes on 
the chip. We used these capacitance values in 
two ways to check the design First, they were 
compiled into the DECSIM MOS transistor-level 
simulator. The timing feature of this tool was 
used to quickly check for gross timing problems 
over the entire chip operating as a whole. Once 
we identified an area as having a possible timing 
problem and for those areas where we believed 
the DECSIM MOS simulation was inaccurate, we 
created and ran SPICE circuit simulations. In a 
second use of the extracted capacitance values, a 
program called PATH was written in the SCAN 
compiler generator language. PATH allowed the 
circuit designers to easily and accurately create 
wirelists representing critical paths for submis- 
sion to SPICE. The program extracts a circuit 
path description from the much larger wirelists 
generated from either the IV tool or the chip- 
wide schematics. Wirelists created by the IV pro- 
gram include interconnect and capacitance infor- 
mation directly from layout artwork. 

Although the chip design process appears in 
this discussion to be a neat progression, the 
various aspects of the actual project quickly over- 
lapped one another. Almost all phases were tak- 
ing place simultaneously on the various sections 
of the chip To keep track of al l these activities 
and continually update the project completion 
date, we used a spread-sheet program as a track- 
ing tool 

The design team of I I people completed the 
project in 2 1 months, including 6 months for 
product conception and 15 months for imple- 
mentation. Due to the extensive modeling and 
simulation prior to device fabrication, initial 
parts were functional at speed. 
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Test Features 

To aid the debugging process and provide more 
complete test coverage, the BIU contains test 
logic. This logic allows visibility to both data 
paths or to the main PLA. A simple test load 
sequence allows one of 16 possible test modes to 
be selected Various groups of internal data path 
and control bits and two test-drive timing options 
are allowed. The test mode can be enabled or dis- 
abled at any time by asserting a single test pin. 
Certain test modes are available while operating 
at full speed in a system configuration 

CFPA Performance 

Although there is no absolute measure of perfor- 
mance in computer system design, the floating 
point performance of the CVAX system is com- 
pared at approximately three times the perfor- 
mance of the MicroVAX II system. Using some of 
the more widely publicized benchmarks of 
floating point system performance, the CVAX sys- 
tem with CFPA running at 25 MHz shows better 
than three times the speed of the MicroVAX II 
with FPU. The system calculates 3, 105K single- 
precision Whetstone instructions per second and 
1.996K double precision Whetstone instructions 
per second Linpack performance of 0.68 Mflops 
single-precision and 0 45 Mflops double-preci- 
sion demonstrate over four times the perfor- 
mance of the previous generation MicroVAX 
implementation. 

Table 3 lists the typical cycle counts for regis- 
ter-to-register execution of floating point addi- 
tion, subtraction, multiplication, and division. 



Table 3 CFPA Cycle Counts for Optimized 
Instructions 



Instruction 


CFPA 
Cycles 


Opcode/ 
Operand 
Transfers 


Total 
Cycles 


ADDF/SUBF 


3 


5 


8 


MULF 


4 


5 


9 


DIVF 


13 


5 


18 


ADDD/SUBD 


4 


7 


11 


MULD 


6 


7 


13 


DIVD 


27 


7 


34 


ADDG/SUBG 


4 


8 


12 


MULG 


6 


8 


14 


DIVG 


26 


8 


34 
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The System Support Chip, 
a Multifunction Chip for 
CVAX Systems 

Developed as a general-purpose companion to the new CMOS VAX VLSI 
chips, the System Support Chip ( SSC ) contains a common core of periph- 
eral system functions which are required to support a MicroVAX system 
environment. These functions include timers, VAX console support, and 
standby RAM. In addition, the SSC provides system designers with "hooks" 
to other system functions. With these peripheral functions integrated on a 
single chip, system designers can substantially reduce the number of com- 
ponents on a module and add features previously not considered cost 
effective. Primarily used with the CVAX CPU chip, the SSC is also compat- 
ible with the NMOS MicroVAX CPU chip. 



Background and Goals 

In 1984, as the VAX 8200 and MicroVAX II chip 
sets entered production, Digital's Semiconductor 
Engineering Group (SEG) directed its attention 
toward defining the next generation of MicroVAX 
systems 1 This paper describes the project 
history and functionality of one of this new gener- 
ation's peripheral chips, the MicroVAX System 
Support Chip (SSC). Developed over a period of 
18 months beginning in late 1984, the SSC was 
designed as a general-purpose companion to the 
CVAX CPU As such, the chip is used in the VAX 
6200 family and in the MicroVAX 3000 family. 2 "* 

As part of the definition of the new CMOS VAX 
family of VLSI chips, SEG looked at the periph- 
eral functions that surrounded the existing 
MicroVAX II CPU. We observed that, to build a 
marketable product, each system group had 
added a collection of timers, decoders, and other 
low- and mid-complexity functions to their 
respective modules. A high level of similarity 
from module to module was apparent in the 
makeup of these functions. 

In addition to examining these existing mod- 
ules, we talked with the system designers to learn 
what additional functions should be included on 
the next generation of systems. Again, we found 
that the various systems under development 
would have a significant number of overlapping 
functional requirements. 



We decided a chip that provided the common 
core of these peripheral functions would be a 
strategic component for Digital products. This 
single chip would integrate many of the periph- 
eral functions usually required on MicroVAX CPU 
modules. Consequently, a system designer could 
substantially reduce the number of components 
on a CPU module and add features that previ- 
ously would not have been cost effective. More- 
over, the chip would allow him to add features 
without lengthening the project schedule or 
requiring extra resources As a result, the system 
designer could produce a more competitive 
Digital product at little additional cost. 

From the system designer's viewpoint, the chip 
would 

■ Fully implement many functions used identi- 
cally across different MicroVAX systems, such 
as timers, ROM support, and standby RAM 

■ Provide the "hooks" to support other func- 
tions that would be implemented differently in 
the different system environments 

Thus each system group would no longer need to 
design, implement, and debug these important 
peripheral functions from scratch. Instead, they 
could use a readily available part that had been 
debugged and qualified. Further, since the SSC 
would use custom CMOS VLSI, this chip would 
contain some additional useful functions, such as 
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general-purpose timers, that are expensive to 
implement in off-the-shelf or gate array tech- 
nology. 

With these goals outlined, we began develop- 
ment of the SSC. The following section presents 
an overview of the chip. In the balance of the 
paper, we describe the chip functions in detail 
and discuss the trade-offs made and problems 
encountered during development. 

SSC Overview 

The SSC incorporates onto a single chip a com- 
mon core of functions required to support the 
VAX system environment. Table 1 lists the essen- 
tial physical characteristics of the chip. Figure 1 , 
a photograph of the chip, shows the major sec- 
tions. Grouped into three main categories, these 
sections are 

■ Support for power-up booting and the VAX 
console 

■ Clock and timing functions 

■ Features required by the VMS operating system 
and those commonly required on a VAX CPU 
module. 

We begin our detailed discussions of the chip 
functions with the SSC console and boot code 
support. 

Console and Boot Code Support 

The peripheral support described in this section 
includes ROM packing, halt-protection, the 
UARTs, and standby RAM. 

ROM Packing 

When a MicroVAX CPU is powered up, it begins 
executing code from read-only memory (ROM). 
To properly communicate with an off-the- 
shelf ROM, the microprocessor requires addi- 
tional interfacing logic. The SSC provides this 
logic by generating the signals needed for the 
ROM-to-microprocessor interface. The SSC also 
provides the packing support for data-width 
compatibility between the ROM and the micro- 
processor. 

At project outset, SSC designers assumed the 
module designers would use four ROMs in paral- 
lel to provide a 32-bit-wide ROM word to the 
CPU. However, with ROMs becoming denser 
every year, it is now possible to put all boot, con- 
sole, and diagnostic code in one or two 8-bit-wide 
ROMs System designers therefore chose to use 
fewer ROMs, decreasing the number of compo 



Table 1 SSC Physical Characteristics 



Total device count 


84,000 (approx.) 


Die size 


8.0 mm x 7.5 mm 


Power dissipation 


Less than 1 .0 W, worst case 


Packaging 


84-pin surface mount 


Clock 


40 MHz external; 20 MHz 




internal; 25.6kHzfortime- 




of-year clock 



nents on the module and thus the product cost. 
The MicroVAX 3000 uses two 64 kilobit (Kb) 
ROMs in parallel, forming a 1 6-bit ROM word. 
The VAX 6200 system uses two 64Kb ROMs in 
series. 

To provide data-width compatibility between 
the 32-bit-wide CVAX bus and the narrower ROMs, 
the SSC includes packing support for 1 6-bit 
word-wide or 8-bit byte-wide external ROM. 
With packing support, the SSC performs multiple 
reads of the narrow ROM word, assembles a 
32-bit longword, and sends the longword back to 
the microprocessor. The SSC performs this func- 
tion by directly driving the output enable and 
address lines 1 and 0 of the ROM. (See Figure 2.) 
The ROM's other address pins are driven by an 
external address latch, and the data lines of the 
ROM drive the CVAX bus directly. 

To pack a ROM, the SSC asserts output enable, 
drives the appropriate combinations of ROM 
address pins 1 and 0, and receives the narrow 
data across the CVAX bus in consecutive ROM 
acess cycles (unbeknownst to the microproces- 
sor). The SSC then deasserts output enable, puts 
the packed longword on the CVAX bus, and com- 
pletes the read transaction. 

CPU Halt-request Protection 
System designers requested that the SSC help 
prevent an undesired condition in the halt logic. 
When the halt pin is asserted on the micro- 
processor, it executes a special trap to console 
code stored in the ROM. A second assertion of 
the CPU's halt pin (typically generated when 
someone repeatedly presses the halt button on 
the system front panel) causes a second such 
trap, overwriting the pointer needed to return 
to program code upon leaving console mode. 
Without this pointer, normal operation of the 
machine cannot be resumed without booting. 
Obviously system designers wanted to prevent 
this condition. 
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Figure 1 SSC Photograph Showing Major Sections 
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Figure 2 SSC ROM Packing Connection 
Diagram 



The SSC prevents the second call by monitoring 
the addresses of all instruction reads and by inter- 
cepting all external halt requests made to the 
CPU. During normal CPU operation, the SSC 
passes an initial halt request to the microproces- 
sor. The microprocessor immediately begins to 
execute from halt-protected space, which is a 
special address space programmed into the SSC 
by the user at boot time. 

When the CPU reads the first instruction from 
console code, the SSC detects this console code 
address and masks further halt requests. These 
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requests are masked as long as the microproces- 
sor is executing ROM console code. The console 
code can then run uninterrupted by halts. 
During console code execution, the SSC con- 
tinues to monitor all instruction addresses. When 
an address outside halt-protected space is 
detected, the SSC re-enables halt requests to 
the CPU 

Before deciding on the design described above, 
we considered implementing a software-con- 
trolled bit that would enable and disable halts. 
This scheme would require the software to set 
the bit upon entering halt-protected space and to 
clear the bit upon re-entering normal operation. 
AJthough apparently simpler, this scheme proved 
to be flawed because two conditions might occur 
that would prevent the user from halting the sys- 
tem: ( I ) the bit could be accidentally set by non- 
boot code, or (2) a software error in the boot 
code could cause the microprocessor to start exe- 
cuting nonsystem code. 

With the plan we chose, control is automati- 
cally returned to the user as soon as the software 
completes execution of the assumably bugfree 
halt-protected boot code The system designer 
can, however, provide software control of the 
halt-enable function by aliasing the boot ROM 
into two adjacent spaces, where only one copy is 
halt protected. The software can then control 
halts by jumping between copies of the code. 
(This method is used on the MicroVAX II and 
MicroVAX 3500/3600 systems.) 

UARTs 

Although it was clear from the beginning that the 
SSC should provide UARTs, the best choice for 
number and design was not immediately clear. 
We had two choices at the time the chip was 
defined. 

■ Double-buffered DEC DLARTs (DC-319), 
which were in wide use, although a few 
problems with this design had recently 
surfaced 

■ Silo designs, which were becoming popular, 
though large in size 

To conserve chip area, the SSC team settled on 
a design very similar to the DEC DLART design, 
making a few improvements in response to user 
requests. To keep from unduly complicating the 
design, we also decided to limit the number of 
UARTs to two (the number supported as console 



ports within the VAX architecture)/ 1 As a further 
simplification, we limited the number of baud 
rates to eight power-of-two choices (300 to 
38,400 baud). 

Our most significant improvement to the 
DLART design was the addition of hardware 
control -P break-detection. Control-P entered on a 
VAX console is interpreted as a halt request. 
Thus, the UART must pick out this special 
keystroke from the normal character stream and 
then signal the CPU to take appropriate action 
Formerly, this function was performed by cum- 
bersome firmware. However, the SSC hardware 
continuously watches for this character and, 
when it senses control-P, automatically signals 
the microprocessor 

The console code may configure the SSC 
such that a break is defined as a control-P or as 
20 spaces; the latter is a definition still used in 
some console applications. At one point, we had 
planned to use the chip timebase to define a 
break as a space lasting a fixed number of mil- 
liseconds instead of 20 spaces However, users 
advised us that this new idea, although more 
elegant, would make the UART more confusing 
to use. 

Other improvements include better notifica- 
tion of overrun and framing errors, and secure 
console support. Console security is effected by a 
pin When grounded, the pin prevents a break 
from halting the CPU. This pin is typically con- 
nected to a key switch on the computer's front 
panel. Using the switch, the user can lock out 
console-induced halts. 

Further, the SSC allows the CPU to directly 
access the UARTs, time-of-year clock, and bus 
reset register by means of the VAX external pro- 
cessor register protocol. Using this protocol, the 
microprocessor can address system registers 
located outside the microprocessor by register 
number rather than by complete address. The 
SSC understands this protocol and is capable of 
decoding the register number and generating 
the desired response. Previously, VAX module 
designers using off-the-shelf UARTs had to imple- 
ment a substantial amount of external logic to 
decode the register addresses and enable the 
UARTs to respond to this protocol. 

Finally, the UARTs support break transmit and 
loopback, and properly respond to VAX inter- 
rupts In products containing the SSC, one UART 
is used as the system console; the other is used 
for auxiliary functions, such as remote diagnos- 
tics, or is disabled. 
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Standby FLAM 

When a VAX system is powered off, the operating 
system must store some information in non- 
volatile memory until the system is powered up 
again. This stored information describes the sys- 
tem configuration and contains pointers to restart 
data stored on the disk. On the Micro VAX II CPU 
module, a watch chip provided 50 bytes of stor- 
age for this purpose. System designers indicated 
this amount was inadequate; 500 to 1 000 bytes 
was desirable. 

To meet this standby storage need, the SSC pro- 
vides 1 kilobyte (KB) of battery backed-up ran- 
dom-access memory (RAM), organized as 256 by 
32 bits. This RAM is also used as a system 
"scratch pad" during power-up test. 

Additional standby support features are de- 
scribed in the section Standby Features. 

Timers 

The SSC timers serve to improve system reliabil- 
ity, meet architecture requirements, and save 
module space. These timers include the pro- 
grammable bus timeout, the interval timer, gen- 
eral-purpose timers, and the time-of-year clock 
discussed in this section. 

Bus Timeout 

Since the CVAX bus is a handshake bus, incom- 
plete bus transactions can hang the system. Some 
older VAX systems permit this condition; when 
those systems were designed, the high cost of 
implementing a timeout in external logic could 
not be justified in relation to the rarity of this 
event. However, the SSC improves system reli- 
ability by providing a programmable bus timeout 
at no additional system cost. 

If a transaction lasts longer than a user- 
programmed interval, the chip 

■ Signals the microprocessor that a bus error has 
occurred 

■ Terminates the transaction 

■ Sets certain internal status Hags based on the 
type of transaction that timed out 

The status flags differentiate the two types of 
timeouts: (I) unexpected timeouts of read or 
write transactions, and (2) permissible timeouts 
caused by some unimplemented external proces- 
sor registers or by certain interrupt-acknowlcdge 
transactions. After the timed-out transaction is 



terminated, error-handling code reads the SSC 
internal status flags and takes the appropriate 
action 

The timeout interval may be programmed in 
I -microsecond increments up to 16 seconds. 
The larger values are used to time out system 
self-test 

Interval Timer 

The VAX architecture specifies a complete inter- 
val clock which the operating system uses to 
schedule time-critical system functions at regular 
intervals. On MicroVAX CPUs, logic for the clock 
is simplified to reduce the amount of circuitry on 
the microprocessor chip. On these microproces- 
sors, only an interrupt-enable bit is imple- 
mented. The timer source is generated externally 
and is driven onto an input pin of the micropro- 
cessor chip When the interrupt-enable bit is set, 
an interrupt request is generated on the falling 
edge of the timer source, which is a 1 00-Hz sig- 
nal on MicroVAX systems. 

The SSC eliminates the need for the module 
designer to place another oscillator on the CPU 
module by providing a 1 00-Hz output suitable 
for driving the interval timer input to the 
microprocessor chip. 

General-purpose Timers 
Early in the SSC development, many potential 
users voiced a need for genera I -purpose timers on 
future MicroVAX modules. However, no one had 
specific recommendations on how such function- 
ality should be implemented. Some users 
requested four timers, whereas others reasoned 
that one timer supported with software could do 
the work of fouroreight timers. 

After some design attempts, we decided to 
copy, bit for bit, the VAX standard interval 
clock. We reasoned that it was prudent to select 
a design that was already well thought out and 
in general use. We did add one control bit to 
provide a one-shot capability. Our decision to 
include two timers was based on the amount 
of available chip area and a desire for some 
redundancy. 

Each timer provides scheduled interrupts 
with 1 -microsecond resolution. The maximum 
interval between interrupts is 1.2 hours. In one- 
shot mode, the timer stops upon generating 
its first interrupt In single-step mode, a count 
can be caused only by writing to a specific 
control bit. The interrupt vector is user-pro- 
grammable. 
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These timers are not used by the CPU module, 
but are available to the end user. We expect them 
to be very helpful to users designing embedded, 
time-sensitive applications. 

Time-of-year Clock 

The VAX architecture requires a battery-backed- 
up time-of-year clock with a resolution of 
10 milliseconds (ms). When the MicroVAX II 
CPU module was designed, the best method for 
providing this feature involved the use of a BCD 
watch chip, approximately one-half gate array of 
logic to interface the chip to the MicroVAX bus, 
and some specially written operating system 
code. Even then the clock provided a resolution 
of only 1 second in standby mode . 

The SSC provides a much more desirable solu- 
tion. Its 32-bit VAX standard time of -year clock, 
driven by an external 25.6 kilohertz (KHz) oscil- 
lator, increments every 10 ms. As with all SSC- 
internal registers, the microprocessor can access 
the time-of-year clock without using any external 
logic. 

To further minimize cost and module space 
usage in systems where battery backed-up clock 
operation is not required, the user may simply 
ground the 25 6 KHz input pin on the SSC 
During normal operation, the time-of-year clock 
will automatically derive its timebase from the 
chip's UART timebase, removing the need for the 
25.6 KHz oscillator on the module 

Other Support Features 

Programmable Address Strobes 
As noted in the section Background and Goals, 
the SSC is designed to provide system designers 
with "hooks" to other system functions One of 
these hooks is the SSC programmable address 
decode strobe function, which adds user 
flexibility and also saves module space. 

Virtually every CPU module needs logic that 
watches the bus for particular addresses and 
asserts signals when these addresses are sensed 
This function is typically embedded in gate array 
logic or in dedicated programmable array logic 
(PAL) chips. 

The SSC has two programmable address decode 
strobes. The user may program each strobe for a 
particular address of Is, 0s, or "don't cares." The 
user can also program selectively for read or 
write transactions. When a strobe channel is 
enabled, the corresponding output pin will assert 
during any bus transaction for which the pro 



grammed address and transaction type are 
matched. 

The strobes can be programmed either to 
provide a hook for external logic or to complete a 
transaction after a delay. When the SSC is pro- 
grammed to provide a hook, the strobe might 
be used to drive an external address decoder or to 
enable another chip. After asserting the output 
strobe, the SSC takes no further action, permit- 
ting another device to complete the bus trans- 
action. 

Alternatively, a strobe can be programmed to 
complete the transaction after a delay that per- 
mits an external device several hundred nanosec- 
onds to respond. When configured in this way, 
the strobe is usually programmed to respond to 
reads of a single longword address. The strobe is 
then wired to enable three-state drivers which 
drive module data onto the CVAX bus. This data is 
often made up of external registers, or of dual in- 
line package switches that indicate baud rate 
selection and other module-specific information. 

Output Port 

Four pins on the chip function as an output port. 
The port is written as a register and is capable of 
driving simple output devices. This output port 
is another general purpose feature that system 
designers need to implement various module- 
specitic functions. Some designers use the port 
pins to drive LEDs, which are then flashed in a 
particular sequence to indicate progress of self- 
test In other applications, system designers have 
used these signals to control external multiplex- 
ers and to provide simple modem control 

Bus Reset 

The VAX architecture requires a reset of the I/O 
system when the CPU issues a write to a particu- 
lar external processor register. This specification 
requires support from both decoding logic and 
I/O system reset logic. In the past, each module 
designer had to implement both logic blocks in 
external hardware. SSC designers saw another 
opportunity to simplify the CPU module by plac- 
ing some of the consistently required logic on 
the SSC 

Although the I/O system reset logic varies 
among systems, the decoding logic is the same in 
each MicroVAX system. The SSC provides this 
core logic, taking three actions. First, the chip 
decodes the external processor register number. 
Then it asserts an output pin in response to the 
external processor register write. Finally, it 
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delays the completion of the write transaction for 
several hundred nanoseconds, so that module- 
specific logic, triggered by the pin assertion, may 
proceed to take the proper action to complete 
the I/O system reset. 

Standby Mode: Power- sensing Features 
When powered down, VAX systems are required 
to maintain a running real-time clock for at least 
100 hours. Retention of some memory is also 
desirable As noted in the section Standby RAM, 
the SSC satisfies these requirements by providing 
a standby operating mode. In this mode, the 
power supply to the module and to the chip pad 
drivers is turned off and most internal logic is dis- 
abled. However, the SSC RAM and time-of-year 
clock are powered by three NiCad batteries sup- 
plying between +3.1 V and +4.5 V at approxi- 
mately 150 microamperes. The batteries also 
power the 25.6-kHz external low-power CMOS 
oscillator, which provides the time-of-year clock 
timebase. Within the SSC, special logic guaran- 
tees smooth transitions from normal operation to 
standby mode. 

As part of providing standby operation, the SSC 
must reliably report at boot time whether standby 
power was continuously maintained during the 
standby period. The task of determining whether 
batten- power had remained stable during the 
standby period was a difficult challenge for the 
SSC designers. There are two ways power can be 
lost during standby: The batteries may run down, 
or someone may replace the batteries. In either 
case, the SSC detects loss of power and reports 
such loss to the CPU during the next boot. 
Except for external logic used for voltage mea- 
surement, this entire function is implemented 
within the SSC as follows. 

When the batteries run down, the unaccept- 
ably low voltage can be detected during boot. 
However, our CMOS process is not optimized for 
the design of logic that can accurately measure 
intermediate voltages. Thus, external circuits 
are used to detect whether battery voltage is 
currently below a minimum level. If voltage is 
below minimum, these circuits assert an SSC 
input pin dedicated to this function. However, 
these external circuits cannot detect temporary 
power losses that occur during standby mode, for 
example, when the batteries are replaced. To 
provide for these cases, a special latch on the 
chip, which powers up in a preferred state, 
detects the interruption of battery power during 
standby or initial power-up. This power-up 



detector latch will operate for arbitrarily slow 
supply transitions. In addition, the latch's reset 
input includes internal filtering for protection 
against fast supply transitions or power-up noise. 

If either the external circuits assert the SSC 
input pin or the special power-up latch indicates 
a loss of power, the SSC sets an internal flag bit at 
boot time. The bit, which indicates that the clock 
and RAM are not valid, is read by the micropro- 
cessor during boot. 

System reliability is improved by the SSC's abil- 
ity to determine the integrity of its standby logic 
and to notify the CPU in a software-accessible 
fashion Moreover, this feature saves design time, 
since designers need not individually create this 
tricky but necessary logic 

Flexible Addressing 

The designers of the SSC determined that the chip 
should fit into any VAX system environment with 
a minimum of external address decoding or sys- 
tem incompatibility. As a result, the SSC control 
and status registers and internal RAM are all situ- 
ated within a relocatable 2KB address space. This 
arrangement eliminates the need for an external 
chip-enable pin and the external decoding logic 
that would be needed to properly assert such a 
pin. The power-up boot code programs the base 
address of the registers by writing a 2KB-aligned 
value to the SSC base address register. 

The SSC base address register is located at a sin- 
gle fixed address, chosen in cooperation with our 
major users. The SSC RAM and registers can then 
be addressed by adding their specified offsets to 
the value in the base address register. A system 
designer can therefore situate the SSC registers 
and RAM (together) anywhere in a system's I/O 
space map. 

Initialization 

To make the SSC especially easy to use, most of 
the SSC configuration bits are grouped in a single 
register. These bits include setup for the UARTs, 
programmable address strobes, ROM packing, 
and halt-protection features. Thus, during system 
initialization, most SSC features can be config- 
ured with a single write. 

MicroVAX and Multi-speed 
Compatibility 

Although targeted primarily as a companion to 
the CMOS VAX CPU, the SSC is also compatible 
with the older NMOS MicroVAX CPU used in the 
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MicroVAX 11. Thus, new low cost or low perfor- 
mance designs using the older microprocessor 
chip can also take advantage of the high integra- 
tion and extra functionality provided by the SSC. 

The SSC is also compatible with modules that 
have either high or low cycle times. Originally 
designed for a 100-ns microcycle, the CVAX 
microprocessor runs at 90 ns in the MicroVAX 
3000 system and at 80 ns in the VAX 6200 
system. Early in the development of the CVAX 
chip set, we decided that chips that were not 
performance-critical, like the SSC, would run at 
just one speed ( 1 00 ns), but would be capable of 
interfacing to a faster-running microprocessor. 
Speed conformability would simplify develop- 
ment, manufacturing, and field support because 
one SSC could be used across all MicroVAX 
systems. 

Accordingly, the SSC bus interface, running at a 
100-ns microcycle, accommodates microproces- 
sors running at microcycles from 1 00 ns to 60 ns. 

Summary 

The SSC project yielded a CVAX microprocessor 
companion chip that provides a high degree of 
functionality, flexibility, and integration. Com- 
prising console support, timers, decoders, and 
other programmable features on a single chip, 
the SSC permits system designers to develop 
smaller, more integrated modules at lower cost. 
Moreover, improvements made to the generalized 
features, such as halt protection and break detec- 
tion, contribute to increased system reliability 
without reducing system design flexibility. 

The utility of the SSC is evidenced by plans to 
include the chip in over a dozen different Digital 
products, such as the MicroVAX 3000 systems, 
the VAX 6200 systems, many XMI adapter 
boards, and various controller products. 
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Development of the 
CVAXQ22-bus Interface Chip 

The CVAX Q22-bus interface chip (CQB1C) is a highly integrated, single 
chip that serves as the interface between the CVAX microprocessor and the 
Q22-bus I/O subsystem. The CQBIC VLSI design is the first produced by 
Digital's Japan Research and Development Center in coordination with 
teams in the U.S. Before implementing the interface design, team members 
built a test chip to ensure the feasibility of a CMOS Q22-bus transceiver and 
to test various design alternatives. Also as part of their research effort, 
they examined alternative designs for several functions, including the 
scatter- gather map cache and the data buffering functions. Project 
designers then implemented the CQBIC using a mix of full custom and 
semicustom design databases. A description of the five major functional 
sections is presented in this paper. 



The CVAX Q22-bus Interface Chip (CQBIC) is an 
evolutionary step in functionality and integration 
from the MicroVAX II CPU module's Q22-bus 
interface design The MicroVAX II CPU module's 
Q22-bus interface comprises 18 discrete chips 
and a gate array; the module design employs 
linked sequential controllers.' The advanced 
CQBIC design integrates these controllers and all 
other interface functionality in a single chip and 
retains the linked controller design. 

Specifically, the CQBIC provides the electrical 
and functional interface between the 32-bit CVAX 
microprocessor and the 1 6-bit Q22 bus I/O sub- 
system. Integrated on the chip are the complete 
Q22-bus interface, data buffering, the CVAX 
bus, 2 direct memory access (DMA) interface, a 
scatter-gather (S/G) map cache, and complex 
control logic Table 1 lists the chip's physical 
characteristics. 

Begun in February 1985, the two-year CQBIC 
project was a joint venture for three of Digital's 
groups: Japan Research and Development Center, 
Large Scale Integration (JRDC/LSI); Semi- 
conductor Engineering Advanced Peripherals 
Development (SEG/APD), and Micro Systems 
Development (MSD). 3 

Project Goals and Organization 

A highly integrated, single-chip, CVAX bus to 
Q22-bus adapter was a desirable product for sev- 



Table 1 CQBIC Physical Characteristics 

Process 2-micron drawn, N-well, 

dual aluminum CMOS 
Number of transistors 40,900 (approx.) 
Die size 9.2 mm x 9.4 mm 

Power consumption 1 .5 W 

Packaging 132-pin surface-mountable 

chip carrier with 25-mil 
lead spacing and heat sink 

Power supply +5 V 



eral reasons. Primarily, such a chip would reduce 
component costs and system module size, and 
increase system reliability as compared with the 
MicroVAX II CPU module's Q22-bus interface. 

Therefore, the primary goal of the CQBIC pro- 
ject was to develop a highly integrated chip as an 
interface between the CVAX microprocessor and 
the Q22-bus. This chip would ease the task of 
Digital's system designers by standardizing the 
interfacing to the Q22 bus and by providing the 
same or improved I/O bandwidth performance as 
the MicroVAX 1 1 CPU module Q22-bus interface. 

Achievement of this performance goal was 
complicated by the single-port memory architec- 
ture of the first planned CPU module and its two- 
level instruction and data, direct-mapped cache 
scheme In comparison, the MicroVAX II CPU 
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module has a dual ported memory architecture 
with no caching. However, the DMA single port 
architecture was required for the new two-level 
cache architecture; with a single-port organiza- 
tion. DMA addresses can be viewed by the caches 
so that the caches can invalidate valid entries dur- 
ing I/O-to-memory write transactions. Conse- 
quently, to both accommodate this architecture 
and meet its performance goals, CQBIC had to 
be designed to consume little CVAX bus band- 
width while performing DMA transactions. Such 
a design would not greatly degrade CVAX 
microprocessor performance. 

A second important project goal was to pre- 
serve I/O performance and operating system 
software compatibility." 1 Therefore, CQBIC 
would provide the same Q22-bus virtual to CPU 
physical memory address translation as contained 
on theMicroVAX II CPU. 

In addition to meeting these goals, the CQBIC 
project would also serve to demonstrate the feasi- 
bility of a remote VLSI design center for the SEC 
organization. Moreover, through this project the 
JRDC/LSI Group would have an opportunity to 
demonstrate its VLSI design capabilities. 

Further complicating the challenges presented 
by the design goals, the distance between the 
working groups, the cultural and work style 
differences, and the language barrier was the 
newness of the JRDC team. Many of the JRDC 
team members could read and write English, but 
had some difficulty speaking and listening to 
English. Also, the Japanese language was com- 
pletely foreign to MSD and SEG Written English 
served as the primary form of communication 
throughout the project. Further, the JRDC team 
members had to learn not only about Digital's 
products and architectures, but also the Q22-bus, 
the other five chip specifications under develop- 
ment, the SEG scmicustom and custom chip 
design tool suites, and Digital's CMOS technol- 
ogy. To help with this steep learning curve, 
experts from each of these areas facilitated the 
training and information flow These experts 
provided answers to specific questions and 
helped to solve specific problems as follow-up to 
formal traini ng sessions. 

Based on the MicroVAX II CPU design experi- 
ence in SEG and MSD, SEG provided leadership for 
both the chip specification development and the 
project This role involved conveying to the JRDC 
team the chip functional definition and detailed 
behavior specifications. This information had to 



be presented in the context of the five other VLSI 
chips being designed by the SEG groups with a 
focus on the CPU module product The U S - 
based project leadership had to provide budget, 
schedule, and task coordination for JRDC, MSD, 
and for other organizations within SEG. 

As the initial customer, MSD performed three 
major specification reviews. This group continu- 
ally provided direction concerning design 
tradeoffs, and requested specific functionality 
revisions to tailor CQBIC more to their CPU 
application. 

Digital's Engineering Network was the primary 
means of transferring written communications 
between groups. We also exchanged information 
by sending facsimile copy and by mailing mag- 
netic tapes and documents At times telephone 
discussions and personal visits were necessary. 

Specification development began with a two- 
week visit to the JRDC facility in Tokyo. At that 
time, we wrote the first draft with key members 
of the JRDC team. This draft specification laid 
the foundation for subsequent architecture and 
functionality research, and served as a communi- 
cation medium. The draft specification was then 
maintained by the JRDC team and SEG and was 
frequently revised and reviewed. 

The following section presents the project 
research conducted to ensure the feasibility of 
project goals and to resolve major questions 
raised by the draft specification. 

Project Research 

Project research focused on two areas. First, we 
wanted to evaluate the risks involved in the 
implementation of a CMOS Q22-bus transceiver. 
For this purpose, SEG team members implemen- 
ted a test chip. Second, we wanted to determine 
the best means to achieve our stated performance 
goals. The tests and studies which we conducted 
and their results are described below. 

Q22 bus Transceiver Test Chip 
To determine whether or not a CMOS Q22-bus 
transceiver could be implemented, several stud 
ics were performed by SEG circuit designers 
responsible for the cell library These studies 
showed feasibility, with two major implementa- 
tion risks: 

■ The proposed differential comparator to be 
used as the receiver required a stable voltage 
reference, 
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■ The 33 1 OO-milliampere (mA) peak, 70-mA 
steady-state sink current Q22-bus transceivers 
were to be on the same substrate as complex 
control circuitry Three problems could 
result: 

- CMOSlatch-upduetochargeinjection from 
input signal overshoot 

Excessive noise due to substrate current 
transients 

- Excessive localized power dissipation 

With several design alternatives available to us, 
we needed more experimental data to determine 
the better alternatives. To obtain this data, a 
Q22-bus octal transceiver test chip was designed, 
fabricated, and packaged by SEG circuit design- 
ers. Available after seven months, this packaged 
octal transceiver test chip was tested in a 
MicroVAX II CPU module and performed well 
under system conditions 

The test chip experiments showed that CMOS 
latch-up due to worst-case overshoots below 
ground did not occur. These results matched our 
expectations. We were not concerned with over- 
shoots above the + 5 volts (V) bias because of the 
Q22-bus termination voltage of 3 4 V, Tests also 
showed that special care would be required in 
the allocation of dedicated ground pins for the 
Q22 bus transceivers to avoid noise coupling 
from substrate bounce and package power-lead 
inductance Also, in the chip layout, we would 
have to use many parallel traces of metal inter- 
connect to prevent metal migration when sinking 
100 mA of peak current. Finally, due to low 
channel resistance of the Q22-bus driver output 
pull down device, the power dissipation of the 
test chip was shown to be within reliable opera- 
tion limits. Therefore, CQBIC power dissipation 
was not a concern in terms of thermal characteris- 
tics of the planned packaging. 

The test chip results did lead to a compromise 
concerning the stable voltage reference. Because 
of large variations in CMOS process materials, a 
precision off-chip or external resistor would bet- 
ter serve to establish the required voltage than 
would some risky process-desensitized structure 
in CMOS 

Prior to these tests, we designed CQBIC to 
facilitate the use of either integral transceivers or 
off-chip transceivers. Fortunately, the test data 
demonstrated the feasibility of a single chip with 
integral Q22-bus transceivers, and the project 



proceeded under a plan that included integral 
transceivers. 

Architecture and Performance Studies 
As the octal transceiver test chip was being devel- 
oped, MSD, JRDC and SEG conducted architec- 
ture and performance studies. These studies 
would answer questions about the organization of 
the S/G mapping function, the data buffering 
required to meet the performance goals, and the 
sequential controllers partitioning and clocking 
to manage the two asynchronous buses and the 
internal functions. 

S/G Mapping 

A RAM structure was first proposed to implement 
the S/G mapping functionality. The MicroVAX II 
CPU design had used such a structure, with two 
8K-by-8 static RAMs. This proposal, however, was 
rejected since not all of the RAM would fit on a 
single chip with all the other required circuitry. 
Increasing the chip size was not an option. The 
chip size was limited for cost reasons as well as 
packaging cavity size reasons. The chip's cost is 
directly proportionate to its size, and the design 
of a new package was outside the scope of the 
project. Moreover, implementation of a portion 
of the RAM would have introduced a system soft- 
ware incompatibility with MicroVAX II and 
would have reduced the planned performance. 

As the problem of S/G mapping functionality 
was studied, it became clear that system memory 
was adequate. Further, CQBIC could not imple- 
ment the full 8192-entry RAM on a chip size that 
could be fabricated with reasonable yield. AJso, a 
capability to prefetch S/G map entries based on 
expectation was considered necessary to sustain 
peak, as opposed to average, performance. We 
looked to the Q22-bus DMA devices which per- 
form transactions with incrementing addresses. 
In particular, Q22-bus devices are designed to 
utilize the Q22-bus block-mode data transfer 
protocol. This protocol transfers data packets of 
eight-word blocks. With this protocol available, 
we could design the CQBIC to cache the S/G map 
entries from system memory on demand and on 
expectation. 

The next two problems were how to imple- 
ment the cache and how many entries to include 
in the cache. A 16-entry cache provided the bal- 
ance we sought between several factors: appro- 
priate chip area, implementation complexity, 
design risk, and DMAI/O performance impact. 
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Data Buffering 

CVAX bus cycle times were targeted to be four or 
more times greater than typical Q22-bus cycle 
times. Also, the CVAX bus was being designed to 
support DMA multidata transfers. This design was 
consistent with the Q22-bus block-mode data 
transfer protocol. To bridge the bandwidth gap 
between the two buses and to minimize the use 
of CVAX bus bandwidth, data buffering tech 
niques were investigated to optimize for Q22-bus 
block-mode throughput for read and write trans- 
actions. These investigations resulted not only in 
the determination of buffer sizes but also in a 
decision on how to control the buffers to opti- 
mize sustained throughput and minimize initial 
latency 

The MicroVAX II CPU is capable of supplying 
read data to the Q22-bus with a very consistent 
access time because memory arbitration is not 
required To achieve MicroVAX II average read 
performance, read data prefetching was consid- 
ered necessary to compensate for the memory 
arbitration time. For CQBIC. the tirst read of a 
Q22-bus transaction would be time delayed by 
the DMA request and grant time, to obtain master- 
ship of the CVAX bus, and by the subsequent sys- 
tem memory access time The delay would always 
be longer than MicroVAX II read latency, which 
had only memory access time read latency to 
consider We determined that two quadword read 
buffers would be sufficient to sustain the 
required throughput because read data is 
prefetched based on expectations of the Q22-bus 
block-mode protocol. Low latency was achieved 
by providing a response to the Q22 bus as the 
first longword of the quadword read data was 
obtained from system memory 

Pipelining the buffered write data could be 
achieved with two buffers, each eight words 
deep An octaword block is the packet size of the 
Q22-bus block-mode protocol and is the maxi- 
mum multitransfer block size of the CVAX bus 
The control logic would be designed to al low one 
buffer to be unloaded to system memory while 
the other was being filled. The latency would be 
better than that of the MicroVAX II CPU module, 
since the CQBIC data was packed into fast octa- 
word buffers. The average throughput would be 
sustained by the four times or greater bandwidth 
of the CVAX bus, as compared to the Q22-bus, by 
the use of pipelined data buffers 

The CQBIC buffering and transaction optimiza- 
tions in conjunction with the CVAX CPU internal 



cache hit rate result in an insignificant DMA I/O 
impact on CVAX CPU performance. Given the 
buffering and control organization and optimiza- 
tions described above, performance difference 
between the single port and the dual-port mem- 
ory designs cannot be detected by a Q22-bus 
device. The result is improvement in Q22-bus 
read and write throughput over the MicroVAX II 
CPU The CQBIC maximizes Q22-bus perfor- 
mance and minimizes CVAX bus usage. Moreover, 
CQBIC can sustain Q22 bus block-mode transfer 
write data rates of 3-3 megabytes (MB) per sec- 
ond and read data rates of 2 5 MB per second 

Finally, to optimize the CVAX I/O write perfor- 
mance, a dump-and run buffer was ro be imple- 
mented in CQBIC. T his buffer is used to avoid 
tying up the CVAX bus while the slower Q22 bus 
transaction completes and while deadlock situa- 
tions are resolved 

Controller Partition 

Given these buffering functions, the control of 
the data path and of the two major bus interfaces 
was naturally partitioned into five linked con- 
trollers and a prioritization function. Each bus 
interface was partitioned into a master and a slave 
controller. The S/G map cache also required a 
controller Then to assist in coordination of con- 
trol flow decisions, a priority resolver function 
was needed 

This partition allows the Q22-bus and the 
CVAX bus to operate in parallel while all dead- 
lock conditions are resolved. Fortunately the 
CVAX chip team implemented a bus transaction 
retry capability This retry capability proved 
essential to our partition and implementation of 
CQBIC control functionality. 

Clocking 

Two primary factors led us to select a 50-nano- 
second (ns) two phase nonoverlapped internal 
clock scheme First, the MicroVAX II CPU mod- 
ule's 50-ns single-phase clocking scheme was 
a proven approach and mapped well to the fixed 
Q22-bus minimum asynchronous timing specifi- 
cations Second, we expected synchronous CVAX 
bus cycle timing to vary with CMOS technology 
improvements The variable CVAX cycle time and 
four-phase overlapped clocking scheme could 
not be used to generate the fixed Q22-bus tim- 
ing Also, having two clocking schemes in one 
chip was determined to be a design too complex 
to manage 
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The implication of the selected CQBIC clock- 
ing scheme was that, with reference to all inter- 
nal controllers, the CVAX bus and the Q22 bus 
were asynchronous. 

Research Results Summary 
The result of the research was a single chip 
design that would achieve the stated project goals 
by providing 

■ Integral Q22-bus transceivers 

■ A 16-entry map cache, with prefetching 

■ Two octaword Q2 2-bus write buffers 

■ Two quadword Q22-bus read buffers, with 
prefetching 

■ A longword CVAX write buff er 

■ Transaction partitioned sequential controllers, 
which are optimized for look-ahead data 
buffering control and for utilization of multi- 
ple-transfer transactions to minimize CVAX 
bus and Q22-bus usage 

The research results were documented in the 
form of a revised chip specilication and a behav 
ioral model The chip was implemented from the 
revised specification with a process which was 
unique and un proven 

Implementation Process 

CQBIC was implemented using a mix of standard 
library cells, custom library cells, and full cus- 
tom layout sections At the time, SEG could not 
offer a formal design tool suite to deal with such 
a mix of full custom and semicustom design data- 
bases. So the JRDC team standardized by select- 
ing the methods of the semicustom tool suite for 
logic and circuit design The semicustom sche- 
matic editor and wire lister were used to design 
all the logic. This wire lister facilitated interfac- 
ing to SPICE and other checking tools and most 
importantly to the layout tools. For layout, no 
automation of floor planning and cell placement 
and routing could be employed. This layout was 
all done by hand, as were the full custom designs 
Interconnect verification and design rule check- 
ing were completed using the tools from the cus- 
tom design suite. 

A full custom layout section was required to 
implement the S/G map cache because of the 
chip-size and latency constraints. A part of the 
latency is due to the Q22-bus address look-up in 
the cache. TheS/G latency had to be small to com- 



pensate for the long latency that could occur, for 
example, when the look-up misses the cache and 
requires an S/G map memory read access. 

The standard cell library was rejected because 
it did not offer a content addressable memory 
(CAM), which is the structure required to facili- 
tate fast address look-ups. In addition, the use of 
standard library cell latches and exclusive OR 
gates was estimated to almost double the desired 
look-up time on the 1 6 cached entries. 

Again to contain chip size and also to meet con- 
trol performance, custom programmable logic 
array (PLA) sections were required. The PIA 
structures offered by the standard cell library 
were too slow and required a clocking scheme 
different from the CQBIC two-phase clocking 
scheme. This decision to implement custom PIj\ 
structures is credited as the reason performance 
goals were achieved. In fact, performance goals 
could not have been achieved without custom 
PIA structures 

At the time logic and circuit design began, the 
standard library cells available for this design 
were found to be inadequate. Many necessary 
functions were missing or were not tailored for 
the specific application. Also, in many cases the 
performance of library cells did not match the 
performance required by the two-phase clocking 
scheme. Hence the JRDC team developed its own 
extensions to the standard cell library The com- 
mon logic structures such as NAND, NOR, flip- 
flop, and latch were used from the standard cell 
library as much as possible, since these struc- 
tures reduced the risk of circuit problems. Cus- 
tom structures, such as counters, multiplexers, 
latched pad transceivers, synchronizers, PIA AND 
plane drivers, and PIj\ OR plane receivers, were 
designed and made available to the library. 

The JRDC team accurately modeled the chip 
based on the specilication at the behavioral and 
the MOS levels of abstraction using Digital's 
DECSIM simulator 

Initially, the JRDC team developed a behav- 
ioral system environment model based on their 
understanding of the CVAX bus and the Q22-bus 
specification. This environment model was 
layered around the CQBIC behavioral model to 
verify the design. As the design progressed, a 
more accurate behavioral chip model replaced 
the initial model after correlation. 

Further, as other CVAX behavioral, structural, 
or MOS chip models matured, MSD incorporated 
them into the CPU system model. This model was 
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then used to test the CQBIC further in the con- 
text of the application system. System simulation 
proved that all CVAX bus specifications which 
were communicated were understood and imple- 
mented correctly. The system simulation served 
as an independent test of the CQBIC design. 
Although no CQBIC problems were found by 
MSD during system simulation, the testing did 
prove that the system would operate We later 
learned that several bugs could have been found 
had more time-varied events been scheduled 
with the system simulation test cases. 

When completed, the CQBIC MOS model was 
correlated to the behavioral chip model. The 
MOS chip model was then placed in the MSD sys- 
tem model for regression testing. 

When we were confident that the CQBIC 
design was complete, that is, when no new bugs 
were found after thorough testing, the chip 
was released toSEG for a final design review and 
submittal for fabrication. The database was 
copied over the Engineering Network from the 
JRDC facility in Tokyo to the Hudson, Massachu- 
setts, plant. After completing a final design 
review and subsequent problem fixes, the chip 
was submitted for fabrication. Eight weeks later 
first-pass parts were probed and found to be func- 
tional. Packaged parts were run in the MSD CPU 
module. This testing revealed several timing bugs 
related to events from both buses occurring at 
the same time. After extensive testing, the bugs 
were fixed, and a second revision was released for 
fabrication. When the second pass part was tested 
in the CPU module, another timing problem 
related to coincident transactions from both 
interfaces surfaced. This particular bug was 
obfuscated by a pass 2 bug. A third revision was 
prepared and fabricated. This third pass was 
available in time for the first customer shipments 
The final chip functionality is briefly described 
below 

The CQBIC Functional Organization 

CQBIC is an asynchronous CVAX bus device and 
requires a fixed 40-megahertz oscillator input to 
derive Q22-bus timing. The oscillator input is 
used to generate a two-phase, nonoverlapped 
clock which is distributed to all chip sections. 
The CVAX bus interface was designed to accom- 
modate transaction cycle times from 100 ns to 
60 ns. This design anticipated a CVAX CPU tech- 
nology change and subsequent performance 
improvement. 



CQBIC provides the power-up, initialization, 
power-fail, and power-down protocols to the 
system and performs Q22-bus and CVAX bus 
address decoding. Further, the chip performs 
the page address S/G mapping function for DMA 
devices by using its 16-entry S/G address map 
cache. 

This cache contains a copy of the most recently 
used S/G pointers, which are located in system 
memory. The cached pointers are used to map 
22-bit Q22-bus virtual to 29-bit CVAX bus physi- 
cal addresses. CVAX bus and Q22-bus transac- 
tions are optimized by using a CPU dump-and-run 
write buffer, two pipelined Q22-bus octaword 
write buffers, and two pipelined Q22-bus quad- 
word read buffers. The chip performs transparent 
address and data alignments, and packing and 
unpacking of internal buffers. 

CQBIC is composed of five global control sec- 
tions. A block diagram of the chip control sec- 
tions is shown i n Figure 1 . 

Each section contains an independent sequen- 
tial controller: 

■ The Q22-bus arbiter 

■ The S/G map 

■ The Q22-bus master 

■ The Q22-bus slave and CVAX bus master 

■ The Q22-bus electrical interface. 

A photomicrograph showing the floor plan of 
the control sections is shown in Figure 2. 

Each section shown in the Figure 1 block 
diagram is described next. 

Q22- bus A rbiter Sectio n 
As a Q22-bus arbiter, the CQBIC is the default 
Q22-bus master and the highest priority 
requester. The arbiter accepts requests from 
Q22-bus DMA devices and from the master sec- 
tion, and grants mastership with first priority to 
the master section. In response to a master 
request, the arbiter exercises a demand master- 
ship protocol to Q22-bus devices to ensure 
low-latency interrupt vector or data reads. In 
response to interrupt requests from the 
Q22-bus, the arbiter receives the requests and 
passes them to the CPU. When the CPU acknowl- 
edges the request, CQBIC reads a vector from 
the Q22-bus device and supplies an acknowledge 
signal 
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Figure 1 Control Section Block Diagram 



When multiple CQB1C chips are connected to 
the Q22-bus, they take on different functions. 
The first chip operates as Q22-bus arbiter; the 
others operate in auxiliary mode. As an auxiliary 
mode device, a CQBIC chip does not perform 
Q22-bus arbitration. Instead, the chip behaves as 
a typical Q22-bus DMA device that is a default 
Q22-bus slave. Therefore, when the CPU initiates 
aQ2 2-bustransaction,itsCQBICrequestsQ22-bus 
mastership. The arbiter CQBIC serves as Q22-bus 
arbiter and grants the bus accordingly to auxil- 
iary mode CQBICsand other DMA devices. 



Either as arbiter or as an auxiliary device, the 
arbiter function performs the system power- 
up, initialization, power-fail, and power-down 
sequences. 

S/G Map Section 

The S/G map consists of 8,192 longwords allo- 
cated from system memory. Each map entry con- 
sists of a 20-bit page pointer, a 3-bit descriptor 
which CQBIC ignores, and a valid bit. The low 
9 bits of a Q22-bus address pass through as an 
interpage offset; the upper 1 3 bits select the con- 
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Figure 2 Photomicrograph of CQB1C 



tents of one of the 8, 192 S/G map locations. The 
CPU informs the CQBIC of the S/G map location 
by writing a base address into the CQBIC map 
base register. This write flushes the valid bits of 
the cached map entries. 

To avoid map cache coherency problems, the 
CPU accesses the S/G map through a VAX I/O 
address range decoded by the CQBIC master sec- 
tion. The slave section then performs the S/G 
map memory transaction. This indirect approach 
prevents the CPU from directly modifying the 
S/G map memory independent of the 16 cached 
pointers. A CPU to S/G map write invalidates the 
cached map entry as the slave section performs 
the memory write. CPU to S/G map reads return 
the cached copy if it was cached or return the 
S/G pointer from system memory. 



As noted in the section Project Research, we 
selected a map cache size of 16 entries. The re- 
search of Q22-bus DMA device transfer sizes and 
the number of devices active in a dynamic system 
showed that 16 entries were sufficient to avoid 
thrashing on entries. The effects of the Q22-bus 
fair arbitration scheme were used to show that 
the simple first-in-first out (FIFO) replacement 
algorithm selected did not waste performance 
and was consistent with incrementing DMA 
device addresses. As a DMA device transfer 
address incremented to a page boundary, the next 
map entry would be prefetched, and the previous 
map entry was not used unless the current I/O 
request completed and another was requested. 
We found that the operating system's allocated 
map entries for I/O requests to Q22-bus DMA 
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devices from a free pool list maintained in a last- 
deallocated, first-al located manner The overhead 
of one extra read for a map entry per page was 
found to be insignificant. 

Q22-bus Master Section 
The master section contains two configuration 
registers and three status-and-error reporting reg- 
isters in addition to all the control circuitry. 

The master section's function is to decode all 
the CVAX bus addresses and cycle status codes. 
This decoding determines which of two types of 
actions is required: 

■ A transaction to an internal register, the S/G 
map, or the Q22-bus 

■ Q22-bus mastership prior to completion of the 
transaction 

Each of these actions is described in the text 
below. 

If a decoded address requires no CQBIC 
response, a signal pin is asserted to external logic 
for control of buffers and timeout counters. 

Transaction to an Internal Register 
When the master section detects a CVAX bus 
address for one of the two control or three 
address registers in CQBIC, it returns or writes 
the data. The master section also facilitates a 
memory lock for the CPU to perform a read-lock 
and write-unlock operation First, the master 
detects a CVAX bus interlocked transaction and 
then performs a retry until Q22-bus mastership is 
obtained Q22-bus mastership is held until an 
unlock transaction or an exception occurs. As 
long as other Q22-bus devices follow this proto- 
col, memory that is mapped to the Q22-bus can 
be shared. 

Transaction to S/G Map 

As noted in the S/G Map section, S/G map trans- 
actions are controlled by the master section. The 
master requests the slave and map cache sections 
to complete the memory and cache transactions. 
To construct the memory address for the slave and 
map cache, the master uses the significant low 
1 3 bits of S/G map I/O address as an offset from 
the map base register. 

Transaction to Q22 bus 

To avoid deadlocks, the master utilizes the CVAX 
CPU retry transaction. (CVAX CPU relinquishes 
CVAX bus control to the CQBIC slave section The 



CPU then retries the same transaction when bus 
control is returned.) S/G map transactions have a 
higher priority than Q22-bus slave transactions 
The slave section therefore performs S/G map 
transactions in parallel with Q22-bus slave trans- 
actions. When the master tries to access the 
Q22-bus and it is busy, the arbiter attempts to 
gain mastership. Until mastership is obtained, 
the slave can perform a retry to satisfy the 
Q22-bus transactions. 

Q22-bus Mastership 

When the master acquires Q22 bus mastership, it 
sequences the transaction. A special case of the 
sequence occurs when the I/O memory segment 
address maps back to system memory through the 
slave and map cache. In this case a retry is used, 
and the slave gives the data to the master. 

The CPU writes to the Q22-bus are accepted 
by the master in a dump and-run manner to 
improve performance. 

Q22-bus Slave Section 
The slave section design implemented the two 
quadword read buffers and the two octaword 
write buffers. This section was the key to realiz- 
ing the performance goals established for the 
chip The slave has to respond to all Q22-bus 
transactions by checking the address in the S/G 
map and then sequencing the CVAX bus to put or 
get data. The slave must coordinate its intentions 
with all other chip sections to avoid deadlock 
conditions. This coordination is realized in a pri- 
oritization circuit which receives state inputs 
from all sections of the chip and outputs status 
codes to the slave and master sections to trigger 
actions. 

The slave watches for master or Q22-bus trans- 
action requests. When the slave receives Q22-bus 
addresses, it passes these to the map cache for 
validation. If the S/G entry is not cached, the map 
cache signals the slave to acquire a ncwS/G map 
pointer from system memory. The map cache 
will cache this new entry if the valid bit is set If 
the valid bit is cleared, then an exception is 
taken. When the address is validated, the slave 
proceeds to sequence the transaction to or from a 
buffer and system memory During slave writes to 
the system memory, the CVAX is signaled to 
invalidate its internal cache 

The slave maintains two octaword write buffers 
to optimize Q22-bus octaword block-mode trans- 
actions. By using a CVAX bus multitransfer burst, 
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the slave can unload one buffer to memory while 
filling the other octaword buffer. 

For each new Q22-bus read request, the slave 
prefetches four words from memory. This pre- 
fetch is done in anticipation of block-mode trans- 
actions. These four words are buffered and sent to 
the Q22-bus master. As the third word is 
unloaded, the slave prefetches four more words. 

As either a Q22-bus block-mode read or write 
transaction nears a page address boundary, the 
slave performs an S/G map entry prefetch of the 
next entry. The slave then passes the prefetched 
entry to the map cache. 

An additional function of the slave section is a 
Q22-bus addressable interprocessor doorbell 
register. This register accommodates arbiter and 
auxiliary mode operation by supplying to the 
CPU a memory access semaphore, an interrupt 
request, and a vector address. 

Q22- bus Electrical Interface Section 
The Q22-bus is a 1 20-ohm transmission line 
with near and far-end parallel termination. The 
length of the Q22-bus can vary from 25 to 60 cen- 
timeters and is subject to reflection and crosstalk 
noise. CQBIC contains 33 transceivers and 
9 receivers which connect directly to the 
Q22-bus. 

The open-drain outputs and filtered inputs 
were designed to operate reliably in the 
Q22-bus environment. 

The input filter rejects crosstalk and reflection 
noise by staging a low pass RC filter. The filter is 
constructed with an n diffusion resistor and 
p-type field effect transistor (PFET) capacitor 
with a differential amplifier receiver which main- 
tains a narrow noise immunity region. 

The open drain output driver controls the edge 
rates. This control minimizes transmission line 
reflections and crosstalk for ac load variation 
from 30 to 330 picofarads, and dc termination 
variation of 240 to 60 ohms at 3 6 volts. To satisfy 
the 100 mA sink current possible on each of 
33 outputs without excessive heating, low inter- 
nal power dissipation was achieved by low 
steady-state "on" resistance. 

A disable control allows the output to power 
down without affecting the Q22-bus. 

Conclusion 

A single chip Q22-bus interface was realized and 
is being shipped in Digital's systems as the result 
of the successful venture forJRDC, SEG, and MSD. 



We learned how to manage efforts from a distance 
and to coordinate and communicate complex 
technical information around the globe. 
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The CVAX CMCTL — 

A CMOS Memory Controller Chip 

The CMCTL — part of the CVAX family of chips — is a high-performance 
ECC memory controller for single-processor systems. Implemented in 
Digital's CMOS technology, the CMCTL is optimized to satisfy Q-bus-based 
system requirements. The CMCTL operates as either a synchronous or an 
asynchronous interface between the CVAX bus at cycles from 60 to 100 
nanoseconds and the private memory interconnect. For memory read or 
write operations, the CMCTL supports the CVAX multiple -transfer proto- 
col. Data parity and memory error checking is implemented for all data 
transfers. The chip 's high performance is achieved in part by a high-speed, 
page-mode access protocol. 



The decision to design a CVAX memory controller 
(CMCTL) was made in July 1984. The primary 
goal of the CVAX CMCTL project was to design a 
high-performance, single-chip, error-correcting 
code (ECC) memory controller for a single- 
processor system. This chip would be part of a 
CVAX family of core peripheral functions 

Several systems being developed at that time 
utilized the MicroVAX II CPU chip, the predeces- 
sor to the CVAX CPU chip Because company rev- 
enue for Q-bus-based systems such as the 
MicroVAX II is significant and a performance 
benefit could be gained from a custom chip 
design, the memory control ler design goals were 
focused to satisfy the requirements of a Q-bus- 
based system. The initial system requirements for 
the CMCTL were determined by studying the 
memory controller specifications and by dis- 
cussing requirements with key members of the 
project team for the existing MicroVAX II system. 
In addition, the Electronic Storage Development 
(ESD) Group was consulted on the requirements 
of a memory controller. 

Let us now examine the key aspects of the 
CVAX CPU chip that influenced the system 
requirements for the CMCTL. First, the CMCTL 



A shorter version of this paper first appeared in the Proceed- 
ings of the 1987 ICCO VLSI Computers and Processors, 
October l l J87 entitled "The CVAX CMCTL. A CMOS Memory 
Controller Chip" by D Morgan, K. Chui. J Clonser. 
S Nadkarni. and R Strouble Copyright 1987, The Institute 
ot Electricaland Electronic Engineering, Inc 



had to interface directly to the CVAX bus and 
handle the memory transactions originating from 
the CVAX CPU chip. Located in the CVAX CPU 
chip is an integral primary write-through 1 -kilo- 
byte (KB) cache. The size of this cache can be 
optionally expanded with a second-level cache 
function on the CVAX bus. Consequently, the 
CMCTL-to-CVAX bus interface had to work with 
or without the optional second-level cache. Fur- 
thermore, the primary cache and the optional 
second-level cache use byte parity for memory 
error detection Therefore, the CMCTL bus inter- 
face was required both to generate and to check 
byte parity. For CVAX-based systems operating at 
1 00-nanosecond (ns) and 60-ns CVAX bus cycles 
and implementing a second-level cache, the per- 
formance goals were respectively 2.5 and 
4.0 times the performance of the MicroVAX II sys- 
tem. These goals governed the CMCTL bus mem- 
ory performance, or memory cycle time, require- 
ments described later in this paper. Since 
memory size requirements are proportional to 
CPU chip performance, the CMCTL had to sup- 
port a memory size larger than that of the 
MicroVAX II. The MicroVAX II CPU memory sys- 
tems have a byte- parity, memory error-detection 
scheme. To meet the reliability requirements for 
larger memory systems, the CMCTL was designed 
primarily as an ECC memory controller. 

Since a direct memory access (DMA) function 
can also become the bus master on the CVAX bus, 
the system requirements for the CMCTL were 
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influenced by these functions also Because the 
CVAX CPU chip performs only synchronous trans- 
actions and a DMA function could be either syn- 
chronous or asynchronous, the CMCTL is designed 
to run as a synchronous or asynchronous slave on 
the CVAX bus Further, the CVAX CPU chip can 
handle only two of the possible four types of data 
transfer lengths on the CVAX bus However, a 
Q-bus DMA function (CQBIC) needed to gener- 
ate all four possible data transfer lengths in order 
to efficiently handle data transfers between the 
Q-bus and the CVAX bus which have data widths 
of 16 bits and 32 bits, respectively The require 
ment to work with the Q-bus DMAfunction meant 
that the CMCTL needed to handle all four data 
transfer lengths In addition, since a DMA func- 
tion could optionally generate and check parity, 
the CMCTL had to be flexible in this regard as 
well Finally, the CV,\X CPU chip executes inter- 
locked instructions which must have the eff ect of 
"locking" or "unlocking" the memory from 
DMA read-modify-write transactions. Interlocked 
memory transactions are not defined in the Q-bus 
protocol Therefore, interlocked memory trans- 
actions are handled with a bus interlock scheme 
In this scheme, the CQBIC stalls, i.e., RETRY, 
the CVAX CPU chip memory read lock bus trans- 
action on the CVAX bus until it becomes the 
Q-bus master first — locking out I/O to mem- 
ory — before the CVAX can perform interlocked 
instructions RETRY is a slave response to a bus 
master on the CVAX bus that tells it to retry the 
bus cycle because it cannot complete the 
requested operation. The CQBIC releases the 
Q-bus after it sees a CVAX CPU chip memory 



write transaction on the CVAX bus that signals the 
termination of the interlock instruction 

Certain base technology constraints influenced 
the CMCTL specification. First, the high perfor- 
mance requirements for memory in a system that 
does not implement a second-level cache deter- 
mined that the CMCTL be implemented in a sin- 
gle custom chip. At the time, it was not possible 
to implement a memory controller with the 
required speed in a commercially available gate 
array that would run synchronous with the CVAX 
CPU chip. Furthermore, in a Q-bus-based system, 
memory expansion occurs in the Q-bus back- 
plane. Therefore, a single memory controller that 
resides on the CPU module and controls the 
memory by means of signals on the backplane is 
the simplest and most quickly implemented sys 
tern solution Another factor that influenced the 
single-chip alternative solution was the limited 
space available on the CPU module that imple- 
ments a second-level cache. Taken together, these 
factors ruled out the possibility of designing a 
slower memory controller using commercially 
available memory controller components for sys- 
tems that implement a second-level cache. The 
availability of CMOS-1 technology in Digital's 
Hudson, Massachusetts, facility in 1984 drove 
the design technology choice 

System Overview 

The CVAX CMCTL is the core control function of 
a single CVAX CPU memory system. This chip 
serves as the interface between devices on the 
CVAX bus and a CMOS private memory intercon- 
nect (PMl) Figure I shows the major interfaces 
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Figure 1 Major Interface Connections of the CMCTL Chip 
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Figure 2 Photomicrograph of CMCTL Showing Major Sections 



of the CMCTL in a CVAX system, and Figure 2 
shows the major sections of the chip. Table 1 lists 
the physical characteristics of the chip. 

This section presents a brief overview of 
the CMCTL chip's two major interfaces, data 
transfer support, and error-checking and notifica- 
tion features. 

CMCTL Major Interfaces 
As interface to the CVAX bus, the CMCTL responds 
as either a synchronous or an asynchronous slave 
device. When the CVAX CPU chip is bus master, 
the CMCTL responses are synchronous. When a 
DMA device is bus master, a bus-mode signal 
determines whether the chip responds as a syn- 
chronous or asynchronous device. 



The CMCTL connects directly to its other major 
interface, the PMI. The PMI consists of control, 
address, and data signals which interconnect the 
CMCTL and the memory array modules. Through 



Table 1 CMCTL Summary Characteristics 



Process 


2-micron drawn, N-well, dual 




aluminum CMOS process 


Number of 


20,000 


transistors 




Die size 


7.6 mm x 8.0 mm 


Power dissipation 


1.5 W worst case 


Packaging 


132-pin surface-mountable chip 




carrier with 25-mil lead spacing 


Power supply 


+5V 
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these interconnections, the chip controls up to 
four memory modules, each containing one, two, 
or four banks of dynamic random-access memory 
(DRAM). Each memory module is required to 
buffer all the PMI signals. 

Data Transfer 

The CMCTL fully supports the CVAX bus multi- 
ple-transfer protocol and can perform one to four 
data transfers on a memory read or write opera- 
tion. Each data transfer can have up to four bytes 
of data. Since ECC is generated across four bytes, 
write data with less than four valid bytes will 
cause the CMCTL to do the actual memory write 
on the PMI as a read-modify-write cycle. Other- 
wise, the write data goes directly to memory. 

Error Checks and Notification 
The CMCTL performs two error-checking func- 
tions: 

■ CVAX bus data parity error checks 

■ Memory error checks 

To assist with the error checking of data transfers 
on the CVAX bus, the CMCTL checks data parity 
on memory writes. The chip generates parity 
with the data on memory reads. 

For data transfers on the PMI, the CMCTL has 
two memory error-checking modes: 7-bit ECC, 
and single-bit parity. In ECC memory error mode, 
the CMCTL detects double-bit uncorrectable 
memory errors and detects and corrects single-bit 
memory errors. In parity memory error mode, the 
CMCTL can detect single-bit memory errors. 

The CMCTL uses four outputs to notify the 
CVAX bus master of four error conditions. These 
error-condition notices are as follows: 

■ The bus transaction was successful and com- 
pleted with no errors. 

■ The memory data transfer resulted in an uncor- 
rectable ECC or parity error. 

■ The memory data transfer resulted in a cor- 
rectable memory error. 

■ The CVAX CPU chip-initiated memory write 
had a parity error. 

In addition to these four outputs, the CMCTL pro- 
vides an output that indicates when the CMCTL is 
not going to respond to either a memory or an 
I/O operation. This output reduces the number 
of external components required to detect 
addresses not implemented in a system. 



CMCTL Performance 

The CMCTL achieves its performance in part by 
using a high-speed, page-mode RAM access pro- 
tocol on the PMI. DRAMs that run in page mode 
can perform data transfers in approximately one- 
half the cycle time of those run in nonpage mode. 

The CMCTL responds to CVAX single-transfer 
memory write or read operations within two or 
four CVAX bus cycles, respectively. During a 
memory read operation, the CMCTL starts a mem- 
ory read access in parallel with an optional cache 
to increase memory read performance. If the 
memory read address hits in the external cache, 
the CMCTL aborts the read operation. The 
CMCTL performs memory write transactions as 
dump-and-run. 

Table 2 lists the memory operations and the 
corresponding performance for synchronous data 
transfers with 4 bytes of data. Two numbers are 
shown for multiple-transfer memory operations. 
The first is the time in CVAX CPU bus cycles to 
complete the first transfer; the second, the time 
to complete subsequent transfers. In order to 
tune the memory performance across different 
CVAX bus speeds, the CMCTL provides a pro- 
grammable mechanism for varying PMI transac- 
tion timing. For CVAX bus cycle times less than 
100 ns, the CMCTL can be programmed to add 
slip cycles to memory read operations in incre- 
ments of the CVAX bus cycle time. The asyn- 
chronous performance of the CMCTL can be 
estimated by adding one bus cycle to the syn- 
chronous data transfer numbers in Table 2 . 

The CMCTL memory read access time is very 
important for systems that do not have a second- 
level cache. For example, a 90-ns CVAX bus cycle 
with a 5/3 CMCTL memory read access with a 
second-level cache results in CPU performance 
3.0 times that of the MicroVAX II. Without the 
second- level cache, the CPU performance is 

Table 2 CVAX CMCTL Read and Write Perfor- 
mance (in Numbers of Bus Cycles) 



Memory Operation CVAX Bus Cycles 

(4 Bytes of Data) 100 ns 90 ns 60 ns 



Single read 


4 


5 


6 


Multiple read 


4/2 


5/3 


6/3 


CPU single write 


2 


2 


2 


DMA single write 


3 


3 


3 


Multiple write 


3/2 


3/2 


3/2 
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reduced by 15 percent, or to 2.5 times the 
MicroVAX II. If the CMCTL memory read access 
was fixed at 6/3 without the second-level cache, 
the CPU performance would be reduced another 
10 percent, or to 2.0 times the MicroVAX II, at a 
90-ns CVAX bus cycle. Therefore, the ability to 
program the CMCTL memory read access time as 
an integral multiple of the CPU bus cycle is a very 
important feature that helps maximize the CPU 
performance. 

CMCTL Functions 

The CMCTL was designed to integrate both the 
control and data path functions required to con- 
trol the data flow to and from memory. 

Registers 

The CMCTL contains two registers: 

■ A status register 

■ A control register 

How each functions within the CMCTL and the 
system is described below. 

The status register is loaded with important 
information when the CMCTL detects an error. 
The system error-handling software uses this 
information to log the error. The CMCTL has a 
memory error status register that captures the 
failed memory address along with the type of 
memory error (bus parity error or memory error) 
and error syndrome. 

In ECC mode, the error syndrome is a 7-bit 
encoded number. For correctable errors, this 
number indicates which data bit was corrected. 
In parity mode, the error syndrome has no useful 
meaning. 

The chip's control register serves several func- 
tions. First, the control register regulates a diag- 
nostic test mode. Second, this register controls 
the PMI cycle tuning. Third, memory error detec- 
tion and correction can be turned on or off to 
facilitate the testing of the CMCTL error-check- 
ing functions and memory module RAMsby mem- 
ory diagnostic software. Finally, a refresh opera- 
tion can be forced for high-speed refresh testing. 

Data Path 

In ECC error detection mode, the data path uses a 
modified Hamming code to detect double-bit 
errors and to detect and correct single-bit errors. 
The PMI interface has 39 signals; 32 are used for 
the memory data, and 7 are the memory check 
bits. In parity error detection mode, the data path 



uses single-bit parity to detect memory errors. 
The data path transport delay for a memory read 
or write is one-half the cycle time of the CVAX 
bus. This performance measure includes module- 
level interconnect delay. 

Memory Control 

The PMI interface provides 20 signals. These sig- 
nals comprise all the control strobes and memory 
address signals needed to control DRAMs. A fast 
memory access time is achieved by detecting a 
valid memory address and starting a memory 
access within 25 percent of a CVAX bus cycle 
time. 

The CMCTL has an integral refresh counter for 
refreshing memory. 

Summary 

The CVAX CMCTL is the core control function of a 
complete memory subsystem. The chip provides 
the control for a flexible memory subsystem that 
functions at CVAX bus cycles from 60 to 1 00 ns. 
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